Sumit Thakur

Director of Engineering

Bengaluru, Karnataka, India10 yrs 10 mos experience
Highly Stable

Key Highlights

  • Led high-performing SRE teams to enhance system reliability.
  • Implemented automation reducing operational tasks significantly.
  • Achieved measurable cost savings through cloud optimization.
Stackforce AI infers this person is a SaaS Infrastructure Engineer with strong expertise in cloud technologies and incident management.

Contact

Skills

Core Skills

Cloud InfrastructureIncident ManagementSecurityAutomationInfrastructure ManagementSoftware Engineering Practices

Other Skills

AWSAWS CloudFormationAWS Command Line Interface (CLI)AWS Shield AdvancedAmazon AthenaAmazon CloudFrontAmazon EC2Amazon KinesisAmazon S3Amazon Web Services (AWS)Apache VelocityArgoCDBashCC++

About

Lead Site Reliability Engineer with experience in designing, implementing, and managing scalable and reliable infrastructure. Adept at leading cross-functional teams and implementing automation for seamless operations. Skilled in cloud technologies, Orchestration tools, SRE practices, and incident management. - Leadership: Led a team of SREs in designing, deploying, and maintaining critical infrastructure components. - Security: Implemented multiple security features for cloud-based infrastructure like Firewall, Shield etc. - Hybrid Cloud Architecture: Contributed to the design and implementation of a high-availability, fault-tolerant hybrid cloud architecture. - Automation: Automated routine operational tasks, reducing manual intervention and improving system efficiency. - CI/CD: Migrated and optimized CI/CD pipelines, reducing deployment times by 30% and enhancing release reliability. - Monitoring & Alerting: Developed and maintained systems to proactively identify and address potential issues. - Collaboration: Worked with development teams to optimize application performance, scalability, and reliability. - Incident Management: Conducted incident reviews, root cause analysis, and implemented preventive measures to enhance system resilience, ensuring timely incident response and minimizing downtime and MTTR.

Experience

10 yrs 10 mos
Total Experience
5 yrs 1 mo
Average Tenure
10 mos
Current Experience

Myntra

Senior Engineering Manager

Aug 2025Present · 10 mos · Bengaluru, Karnataka, India · On-site

Khoros

6 roles

Site Reliability Engineering Manager

Promoted

May 2023Nov 2025 · 2 yrs 6 mos

  • Lead and mentor a high-performing SRE team, driving a culture of ownership, collaboration, and continuous improvement.
  • Define and uphold SLOs/SLIs to ensure high availability and performance for critical services.
  • Oversee cloud infrastructure (AWS, EKS) with a focus on scalability, reliability, and automation via Terraform, Helm, and ArgoCD.
  • Streamline incident management processes, resulting in a significant reduction in MTTR.
  • Observability practices using Datadog and Sumo Logic, optimizing alerting and monitoring strategies.
  • Enforce infrastructure and application security using mTLS, RBAC, and WAF rules.
  • Collaborate cross-functionally with Dev, Product, and Support teams for production readiness and smooth deployments.
  • Lead monthly reviews on cloud spend and implemented DynamoDB cost optimization strategies.
  • Highlights:**
  • Reduced noisy alerts by 40% by refining monitoring thresholds and custom alerting logic.
  • Built internal tools to streamline incident triaging and reduce MTTR.
  • Implemented 90/10 canary deployments for safer rollouts and quick rollback capability.
  • Achieved measurable AWS/EKS cost savings by optimizing pod resource requests and auto-scaling behavior.
  • Successfully onboarded new SREs with structured access, mentorship, and documentation support.
AWSEKSTerraformHelmArgoCDDatadog+6

Lead Site Reliability Engineer

May 2022May 2023 · 1 yr

  • Engineered an automated Web Application Firewall (WAF) solution utilizing AWS and GitHub Actions to secure load balancers and API Gateways.
  • Successfully integrated AWS Shield Advanced at the CDN level to mitigate DDoS attacks.
  • Developed a robust internal tool for Site Reliability Engineering (SRE) debugging, resulting in significant improvements in Mean Time to Recovery (MTTR).
AWSGitHub ActionsWeb Application FirewallAWS Shield AdvancedSecurityAutomation

Senior SRE

Promoted

Mar 2021May 2022 · 1 yr 2 mos

  • Migrated existing Datadog monitors from the Datadog UI to Terraform, enabling infrastructure-as-code for better manageability and scalability.
  • Onboarded multiple new applications to the SRE framework, enhancing operational efficiency and reliability.
TerraformDatadogAWSMicroservicesIncident ResponseInfrastructure Management+1

SRE-3

Promoted

Oct 2019Mar 2021 · 1 yr 5 mos

AWSGitCI/CDShell ScriptingSoftware Engineering Practices

SRE-2

Promoted

Oct 2018Oct 2019 · 1 yr

CI/CDShell Scripting

SDET-2

Feb 2018Oct 2018 · 8 mos

GitSoftware Engineering Practices

Sprinklr

SDET

Jul 2015Jan 2018 · 2 yrs 6 mos

GitSoftware Engineering Practices

Education

sjce

Engineer’s Degree — Information Science

Jan 2011Jan 2015

Stackforce found 100+ more professionals with Cloud Infrastructure & Incident Management

Explore similar profiles based on matching skills and experience