M

MOHIT SHARMA

Software Engineer

Bengaluru, Karnataka, India9 yrs 5 mos experience

Key Highlights

  • Expert in building robust internal developer platforms.
  • Proficient in chaos engineering and automated recovery.
  • Strong background in cloud cost management and observability.
Stackforce AI infers this person is a Cloud Infrastructure and Site Reliability Engineering expert in the SaaS industry.

Contact

Skills

Core Skills

Cloud InfrastructureChaos EngineeringInternal Developer PlatformSynthetic Data GenerationService MeshKubernetesSite Reliability EngineeringDisaster RecoveryCost ManagementRelease AutomationBackend Development

Other Skills

IDPInternal Process DevelopmentCloud Cost ObservabilityAutomated Recovery PatternsPythonDjangoOpenTelemetryFinOpsRESTful APIsDjango REST FrameworkGitOpsObservabilityIncident ManagementAutomationAnsible

About

I build systems that make engineers faster, safer, and more efficient. With over 9 years of experience in high-scale environments—including CoinDCX, Wayfair, Walmart, British Telecom, Deloitte, and TCS—I specialize in moving beyond simply "keeping the lights on" to building platforms that drive true engineering excellence. My work is focused on creating robust, self-service infrastructure that eliminates operational friction and allows engineering teams to ship with confidence. Across these diverse organizations, I have prioritized building custom platform solutions that move the needle. This includes designing internal developer portals that abstract underlying complexity, engineering in-house observability suites that provide granular, context-aware insights, and implementing comprehensive resilience frameworks. By integrating chaos engineering and automated recovery patterns, I ensure that systems remain stable under pressure. I also take a disciplined approach to FinOps, utilizing tools like Amnic and custom dashboards to turn infrastructure spend into an optimized, predictable utility that supports growth rather than hindering it. My approach is grounded in deep technical engineering rather than just tool configuration. I focus on building Kubernetes-native controls using Custom Resource Definitions and Admission Controllers, automating certificate lifecycles, and managing service mesh configurations with Istio. Whether it is provisioning ARM64-based edge devices or building synthetic data tooling for ephemeral environments, my goal is always to reduce toil. I believe that a great platform is one that empowers developers to own their code from local development all the way to production. Beyond my core expertise in Kubernetes, Terraform, Ansible, and cloud ecosystems like Azure and GCP, I am a firm believer in SRE principles. I actively define and measure SLIs, SLOs, and error budgets, using post-mortem analysis to turn incidents into structural improvements. I am currently focused on the intersection of platform security and developer velocity, and I am always looking for new ways to solve complex architectural problems through clean, scalable, and automated design.

Experience

9 yrs 5 mos
Total Experience
--
Average Tenure
--
Current Experience

Coindcx

2 roles

Staff Software Engineer

Nov 2025Present · 7 mos

  • ● Crafting internal developer portals to streamline workflows and improve the developer experience.
  • ● Architecting custom, in-house observability tools to drive deep insights into distributed system performance.
  • ● Engineering a comprehensive resilience framework, integrating chaos engineering principles and automated recovery patterns to ensure mission-critical uptime and system fault tolerance.
  • ● Led the adoption of Amnic to establish a cloud cost observability framework, providing granular visibility into multi-cloud and Kubernetes.
IDPCloud InfrastructureInternal Process DevelopmentCloud Cost ObservabilityChaos EngineeringAutomated Recovery Patterns

Platform Engineer

Aug 2024Present · 1 yr 10 mos

  • Architecting an Internal Developer Platform (IDP) using Python/Django and Backstage, improving Developer Experience (DevEx) for 100+ engineers and reducing service onboarding time by ~60%.
  • Built custom in-house OpenTelemetry-based observability tooling in Go, surfacing deep performance insights across 30+ distributed microservices — replacing a $40K/yr third-party solution.
  • Engineered a resilience framework with chaos engineering (fault injection, automated recovery patterns), achieving 99.95% uptime SLO for mission-critical trading infrastructure.
  • Led adoption of Amnic for FinOps and cloud cost observability, delivering granular multi-cloud and Kubernetes spend visibility and driving 25% reduction in cloud waste.
PythonDjangoOpenTelemetryChaos EngineeringFinOpsKubernetes+1

Wayfair

2 roles

Senior Software Engineer

Aug 2024Nov 2025 · 1 yr 3 mos

  • ● Enhanced test data quality and ensured privacy compliance by integrating Tonic.ai into pre-production environments for synthetic data generation.
  • ● Automated creation of ephemeral environments by integrating Garden, streamlining development processes.
  • ● Built an internal synthetic data generation tool to support development and testing without exposing sensitive data
  • ● Improved security and observability by implementing Service Mesh (Istio) for environments.
  • Developed RESTful APIs using Python (Django REST Framework) to manage ephemeral environments with Garden, reducing environment setup time by 70%.
  • ● Strengthened security and observability by implementing Istio Service Mesh across environments
  • ● Improved platform security and governance by designing and implementing Custom Resource Definitions (CRDs) and Admission Controllers in Kubernetes.
Synthetic Data GenerationService MeshRESTful APIsPythonDjango REST Framework

Platform Engineer

Aug 2024Nov 2025 · 1 yr 3 mos

  • Built RESTful APIs (Python, Django REST Framework) to manage ephemeral dev environments via Garden, reducing environment setup time by 70% and improving Developer Experience for 50+ engineers.
  • Developed an internal synthetic data generation service integrating Tonic.ai, cutting pre-production data provisioning time by 80% while ensuring full privacy compliance.
  • Designed and implemented Kubernetes Admission Controllers and CRDs to enforce platform security policies, blocking 100% of non-compliant workloads at admission time.
  • Deployed Istio Service Mesh with GitOps-driven config management (ArgoCD), improving inter-service observability and enforcing mTLS security across all environments.
PythonDjango REST FrameworkKubernetesService MeshGitOps

Walmart global tech india

2 roles

Senior Site Reliability Engineer - Platforms

May 2023Feb 2025 · 1 yr 9 mos · Bangalore Urban, Karnataka, India · Hybrid

Site Reliability Engineer - Platforms

Jun 2021May 2023 · 1 yr 11 mos · Bangalore Urban, Karnataka, India · Hybrid

  • ● Designing and implementing comprehensive observability strategies, real-time monitoring, log analysis, and performance optimization.
  • ● Established incident management procedures and post-mortem review processes, contributing to a reduced mean time to recovery (MTTR) and driving continuous improvement in incident response.
  • ● Created automation for onboarding arm64 devices (AGX’s) to new Walmart store including setting up of salt master and minions using Jenkins, Ansible etc.
  • ● Created Certificate Renewal automation to reduce time taken for renewing Certificates.
  • ● Collaborated with cross-functional teams to establish and meet Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring the availability, latency, and error rates of critical services.
  • ● Created One Click Disaster Recovery process using Python and Ansible.
  • ● Developed python code for attaching hosts GPUs to Virtual Machines.
  • ● Created a dashboard in Grafana to monitor cluster resource usage and
  • performance metrics software engineers use to troubleshoot problems with
  • their infrastructure.
  • ● Created rest APIs for CRUD operations using Django rest framework.
  • ● Created Automation for spinning up new VM’s in ESXI host using ansible and
  • Jenkins.
  • ● Setup and Managing Voxel51(Teams51) Machine Learning platform to run
  • data science models using Terraform, Jenkins etc...
  • ● Created pipeline for managing SDLC of arm64 devices (AGX) using Docker,
  • Salt, Python, Ansible etc...
  • ● On Call to support production infrastructure.
  • ● Created an application portal using HTML/CSS, JavaScript and Django used by
  • teams across Walmart.
ObservabilityIncident ManagementAutomationAnsiblePythonSite Reliability Engineering

Walmart

Platform Engineer

Jun 2021Aug 2024 · 3 yrs 2 mos

  • Architected a One-Click Disaster Recovery Platform (Python, Django REST) automating multi-region failover and validation — reducing DR time by 80% across 200+ services.
  • Built Cost Insights API (Go + PostgreSQL) aggregating GCP and Azure billing data, enabling FinOps governance and delivering team-level cost visibility to 15+ engineering teams.
  • Developed certificate lifecycle automation services (Go, Jenkins, Vault, Ansible), eliminating manual cert operations and reducing cert-related incidents to zero.
  • Designed OpenTelemetry-compatible observability frameworks across 100+ services (Prometheus, Grafana), reducing MTTD by 30% and improving SLI/SLO adherence.
  • Standardized incident post-mortem workflows and SLO/SLI alerting across 100+ services, cutting recovery time by 20% and reducing repeat incidents by 35%.
Disaster RecoveryCost Insights APICertificate Lifecycle AutomationObservability FrameworksCost Management

Deloitte

2 roles

Site Reliability Engineer -Platforms

Apr 2020Mar 2021 · 11 mos

  • ● Worked with application team to migrate their monolithic applications to microservice architecture using docker, Kubernetes, Jenkins etc...
  • ● Collaborated with engineering teams to establish Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring alignment with business objectives and end-user expectations.
  • ● Established incident management procedures and post-mortem review processes, contributing to a reduced mean time to recovery (MTTR) and driving continuous improvement in incident response.
  • ● Collaborated with cross-functional teams to establish and meet Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring the availability, latency, and error rates of critical services.
  • ● Developed a tool to automate the creation of Kubernetes clusters using Ansible, python and implemented it into Jenkins CI/CD pipeline.
  • ● Created dashboards for measuring the disk utilizations, server health and other critical data to remove the overhead during on-call.
  • ● Created Grafana dashboard’s for monitoring incoming traffic to clusters, Pods memory utilization etc...
  • ● Experience of providing L3/L4 support to Production Troubleshooting and Incident Management.
Microservice ArchitectureIncident ManagementKubernetesAnsiblePythonSite Reliability Engineering

Platform Engineer

Apr 2020Mar 2021 · 11 mos

  • Built Python microservices and APIs for resource tracking and capacity planning across 50+ hybrid (cloud + on-prem) projects.
  • Automated Kubernetes cluster lifecycle management (Python, Ansible, Jenkins), cutting cluster setup time by 65% and migrating on-prem workloads to containerised environments.
  • Developed Prometheus + Grafana dashboards for real-time cluster monitoring; automated incident recovery playbooks reducing MTTR by 40%.
CI/CDAutomationTest PlansRelease Automation

British telecom global services india limited

Site Reliability Engineer - Platforms

Mar 2019Apr 2020 · 1 yr 1 mo · bangalore

  • ● Created Build and Release Pipelines to implement CI/CD practices.
  • ● Migrated all the manual deployments into CI/CD practice using GIT, Jenkins etc..
  • ● Enhanced the efficiency of legacy systems by identifying redundant processes and converting 50+ manual tasks to automated processes.
  • ● Designed 10+ innovative test plans for software releases, which identified client-side regressions before deployment.
PythonFlaskCI/CDRelease Automation

British telecom

Platform Engineer

Mar 2019Apr 2020 · 1 yr 1 mo

  • Implemented a centralised release automation API (Python, Flask, Jenkins) that reduced manual deployment effort by 60%, enabling 3x faster release cycles.
  • Migrated manual deployments to fully automated CI/CD pipelines (Git, Jenkins, Python); automated 50+ legacy operational tasks, saving 20+ engineering hours per week.

Tata consultancy services

3 roles

System Engineer

Oct 2018Mar 2019 · 5 mos

Software Engineer

Oct 2016Mar 2019 · 2 yrs 5 mos

  • Developed backend automation APIs (Python, Django) for deployment orchestration; containerised Java, Ruby, and Python apps with Docker + Jenkins CI/CD pipelines.
  • Created Ansible playbooks for vSphere VM provisioning; built ELK + Grafana observability dashboards, reducing incident triage time by 50%.

Assistant System Engineer

Oct 2016Oct 2018 · 2 yrs

PythonDjangoDockerAnsibleBackend Development

Education

Rajiv Gandhi Prodyogiki Vishwavidyalaya

Bachelor of Engineering - BE

Aug 2012Jun 2016

Gyan Ganga Institute of Technology Sciences

Bachelor of Engineering

Stackforce found 100+ more professionals with Cloud Infrastructure & Chaos Engineering

Explore similar profiles based on matching skills and experience