Sumit Thakur

Director of Engineering

Bengaluru, Karnataka, India10 yrs 10 mos experience

Highly Stable

Key Highlights

Led high-performing SRE teams to enhance system reliability.
Implemented automation reducing operational tasks significantly.
Achieved measurable cost savings through cloud optimization.

Stackforce AI infers this person is a SaaS Infrastructure Engineer with strong expertise in cloud technologies and incident management.

Contact

Skills

Core Skills

Cloud InfrastructureIncident ManagementSecurityAutomationInfrastructure ManagementSoftware Engineering Practices

Other Skills

AWSAWS CloudFormationAWS Command Line Interface (CLI)AWS Shield AdvancedAmazon AthenaAmazon CloudFrontAmazon EC2Amazon KinesisAmazon S3Amazon Web Services (AWS)Apache VelocityArgoCDBashCC++

About

Lead Site Reliability Engineer with experience in designing, implementing, and managing scalable and reliable infrastructure. Adept at leading cross-functional teams and implementing automation for seamless operations. Skilled in cloud technologies, Orchestration tools, SRE practices, and incident management. - Leadership: Led a team of SREs in designing, deploying, and maintaining critical infrastructure components. - Security: Implemented multiple security features for cloud-based infrastructure like Firewall, Shield etc. - Hybrid Cloud Architecture: Contributed to the design and implementation of a high-availability, fault-tolerant hybrid cloud architecture. - Automation: Automated routine operational tasks, reducing manual intervention and improving system efficiency. - CI/CD: Migrated and optimized CI/CD pipelines, reducing deployment times by 30% and enhancing release reliability. - Monitoring & Alerting: Developed and maintained systems to proactively identify and address potential issues. - Collaboration: Worked with development teams to optimize application performance, scalability, and reliability. - Incident Management: Conducted incident reviews, root cause analysis, and implemented preventive measures to enhance system resilience, ensuring timely incident response and minimizing downtime and MTTR.

Experience

10 yrs 10 mos

Total Experience

5 yrs 1 mo

Average Tenure

10 mos

Current Experience

Myntra

Senior Engineering Manager

Aug 2025 – Present · 10 mos · Bengaluru, Karnataka, India · On-site

Khoros

6 roles

Site Reliability Engineering Manager

Promoted

May 2023 – Nov 2025 · 2 yrs 6 mos

Lead and mentor a high-performing SRE team, driving a culture of ownership, collaboration, and continuous improvement.
Define and uphold SLOs/SLIs to ensure high availability and performance for critical services.
Oversee cloud infrastructure (AWS, EKS) with a focus on scalability, reliability, and automation via Terraform, Helm, and ArgoCD.
Streamline incident management processes, resulting in a significant reduction in MTTR.
Observability practices using Datadog and Sumo Logic, optimizing alerting and monitoring strategies.
Enforce infrastructure and application security using mTLS, RBAC, and WAF rules.
Collaborate cross-functionally with Dev, Product, and Support teams for production readiness and smooth deployments.
Lead monthly reviews on cloud spend and implemented DynamoDB cost optimization strategies.
Highlights:**
Reduced noisy alerts by 40% by refining monitoring thresholds and custom alerting logic.
Built internal tools to streamline incident triaging and reduce MTTR.
Implemented 90/10 canary deployments for safer rollouts and quick rollback capability.
Achieved measurable AWS/EKS cost savings by optimizing pod resource requests and auto-scaling behavior.
Successfully onboarded new SREs with structured access, mentorship, and documentation support.

AWSEKSTerraformHelmArgoCDDatadog+6

Lead Site Reliability Engineer

May 2022 – May 2023 · 1 yr

Engineered an automated Web Application Firewall (WAF) solution utilizing AWS and GitHub Actions to secure load balancers and API Gateways.
Successfully integrated AWS Shield Advanced at the CDN level to mitigate DDoS attacks.
Developed a robust internal tool for Site Reliability Engineering (SRE) debugging, resulting in significant improvements in Mean Time to Recovery (MTTR).

AWSGitHub ActionsWeb Application FirewallAWS Shield AdvancedSecurityAutomation

Senior SRE

Promoted

Mar 2021 – May 2022 · 1 yr 2 mos

Migrated existing Datadog monitors from the Datadog UI to Terraform, enabling infrastructure-as-code for better manageability and scalability.
Onboarded multiple new applications to the SRE framework, enhancing operational efficiency and reliability.

TerraformDatadogAWSMicroservicesIncident ResponseInfrastructure Management+1