Anant Kumar

SRE (Site Reliability Engineer)

Hyderabad, Telangana, India13 yrs 2 mos experience
Highly Stable

Key Highlights

  • Expert in building and leading SRE teams.
  • Proven track record in multi-cloud CI/CD pipeline development.
  • Strong focus on incident management and operational excellence.
Stackforce AI infers this person is a DevOps and Site Reliability Engineering expert in the SaaS industry.

Contact

Skills

Core Skills

DevopsSite Reliability EngineeringContinuous Integration And Continuous Delivery (ci/cd)Infrastructure Automation

Other Skills

AWSAnsibleBashChaos EngineeringCloud Computing IaaSCloud DevelopmentContinuous Integration (CI)Control MDatadogDockerELKFinancial OperationsGitGitLabGithub

About

I am simple thinking,hard working and sincere in all assignments. I always keep "can do approach" for any assignment. I like challenging jobs and always put forth my best to achieve the goal.

Experience

Lloyds technology centre india

DevSecOps Engineering (Grade E)

Jan 2025Present · 1 yr 2 mos · Hyderabad, Telangana, India · Hybrid

  • Led and mentored a team of Site Reliability Engineers, fostering skill development, reliability-focused culture, and operational excellence.
  • Designed, built, and maintained multi-cloud CI/CD pipelines using GitHub Actions for deployments across AWS, GCP, and Azure.
  • Established robust incident response frameworks, driving root cause analysis (RCA) governance and continuous post-incident improvement processes.
  • Developed OS-level guardrails and implemented cloud security policies leveraging Wiz, enhancing compliance and security posture across environments.
  • Automated provisioning and configuration of multi-cloud infrastructure using Terraform, improving scalability and reducing manual effort.
  • Managed and optimized Kubernetes workloads to ensure high availability, compliance, and cost efficiency across clusters.
GithubTerraformGoogle Cloud Platform (GCP)Microsoft AzureAWSContinuous Integration and Continuous Delivery (CI/CD)+2

Demandbase

Principal Engineer (SRE/DevOps/Platform)

Sep 2021Jan 2025 · 3 yrs 4 mos · Hyderabad, Telangana, India · Remote

  • Led cloud infrastructure strategy and delivery, defined roadmaps, set OKRs and driven cross-team alignment to build secure, scalable, and automated platforms.
  • Designed, built, and maintained CICD pipelines for multiple microservices across multi-cloud environments (AWS, GCP), ensuring scalability and reliability.
  • Implemented cloud cost optimization strategies across AWS and GCP, reducing overall infrastructure spend by ~25% through rightsizing, workload scheduling, and monitoring unused resources.
  • Built an automated AWS and GCP provisioning platform through Terraform IaC and GitLab, reducing setup time by 70% and improving governance.
  • Designed and operated AWS EKS and GCP GKE Kubernetes clusters, improving deployment speed by 40%.
  • Implemented a unified observability stack (Prometheus, Grafana, Datadog), reducing incident detection time by 30%.
  • Established SLIs (Latency, Errors, Traffic, Saturation) and SLOs with Error Budgets across key microservices, driving proactive monitoring and reliability improvements.
  • Led incident management and postmortem processes for critical production services, reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) by 30% through improved alerting and automation.
  • Designed and rolled out GitOps pipelines with FluxCD to automate Kubernetes deployments, ensuring version-controlled, auditable, and policy-compliant delivery processes.
  • Automated key operational tasks using Python and Bash, reducing manual toil by 40% and improving team efficiency and response times.
  • Drove the adoption of a Shift Left approach by integrating developer productivity and automation tools, improving code quality and accelerating feedback cycles across development teams.
  • Spearheaded PoC and PoV evaluations of emerging cloud-native technologies, resulting in successful tool adoption and enhanced operational efficiency.
KubernetesContinuous Integration and Continuous Delivery (CI/CD)Financial OperationsTerraformAWSGoogle Cloud Platform (GCP)+3

Walmart global tech

Senior Software Engineer-IV (DevOps/Platform)

Oct 2020Sep 2021 · 11 mos · Bengaluru, Karnataka, India · On-site

  • Led and mentored the Platform Engineering team, promoting adoption of Site Reliability Engineering (SRE) best practices to improve system resilience and operational maturity.
  • Automated infrastructure provisioning and configuration using Terraform, Python, and Ansible, enhancing deployment consistency and reducing manual effort.
  • Designed and implemented CI/CD pipelines with Jenkins and Concord, enabling global, scalable, and secure application deployments.
  • Directed incident response and production troubleshooting efforts, establishing a blameless RCA culture and improving mean time to recovery (MTTR).
  • Built and managed observability frameworks leveraging Datadog, Prometheus, Grafana, and Splunk, ensuring centralized monitoring, log analytics, and proactive alerting.
  • Administered and optimized Splunk clusters, managing indexers, search heads, and ingestion pipelines to improve performance and query efficiency.
TerraformPythonAnsibleJenkinsDatadogPrometheus+4

Lowe's india

Lead Site Reliability Engineer

Jan 2019Jan 2020 · 1 yr · Greater Bengaluru Area · On-site

  • Built and led the SRE team from the ground up, including hiring, onboarding, and mentoring engineers to establish a culture of reliability and operational excellence.
  • Orchestrated and managed microservices on Google Kubernetes Engine (GKE), leading migration of critical workloads from on-premise infrastructure to Google Cloud Platform (GCP).
  • Defined and implemented SLIs, SLOs, SLAs, and Error Budgets, improving observability, accountability, and overall system reliability.
  • Designed and implemented capacity planning and performance engineering frameworks, ensuring scalability and optimal resource utilization.
  • Established comprehensive incident management processes, including blameless RCA governance, runbooks, and monitoring dashboards for faster issue resolution.
  • Introduced and operationalized chaos engineering practices to validate system resilience and proactively address potential failure scenarios.
Google Cloud Platform (GCP)Chaos EngineeringSREContinuous Integration and Continuous Delivery (CI/CD)MentoringSite Reliability Engineering

Nextgen healthcare

Senior Site Reliability Engineer

Aug 2017Sep 2019 · 2 yrs 1 mo · Bangalore · On-site

  • Implemented end-to-end observability stack using Datadog and Sumologic, improving incident detection and reducing alert fatigue.
  • Automated provisioning and deployment of infrastructure using Terraform and Ansible, cutting manual operations effort by 50%.
  • Collaborated with development teams to define and monitor SLIs/SLOs, enabling data-driven reliability improvements and faster recovery from failures.
  • Enhanced CI/CD pipelines with Jenkins, integrating automated testing, canary and blue-green deployments, and rollback mechanisms.
  • Participated in on-call rotations and conducted blameless postmortems, identifying recurring issues and driving long-term remediation efforts.
DatadogTerraformAnsibleJenkinsSite Reliability Engineering

Sap

Senior Cloud Support Engineer (SaaS)

Jun 2016Aug 2017 · 1 yr 2 mos · Bengaluru Area, India · On-site

  • Automated repetitive operational workflows using Python and Bash, improving system efficiency and reducing manual intervention.
  • Monitored and optimized production environments through Splunk and Pingdom, configuring proactive alerts to ensure high availability and rapid issue detection.
  • Supported SaaS release engineering and data migration initiatives, ensuring smooth deployments and minimal downtime during critical releases.
  • Strengthened change management and incident response processes by improving root cause tracking and enhancing team coordination during major incidents.
PythonBashSplunk

Bankbazaar.com

Production Support Engineer

Nov 2014Jun 2016 · 1 yr 7 mos · Bangalore · On-site

  • Spearheaded the migration of legacy on-prem workloads to AWS, implementing secure Redshift cluster architecture and ensuring seamless data transition.
  • Administered and enhanced Jenkins CI/CD pipelines to streamline release processes and strengthen deployment reliability across environments.
  • Drove production incident management and RCA processes, ensuring SLA adherence and contributing to continuous reliability improvements.
AWSJenkinsLinuxSQLDevOps

Cognizant technology solutions

Programmer Analyst

May 2012Nov 2014 · 2 yrs 6 mos · Greater Hyderabad Area · On-site

  • Proactively monitored production environments and resolved critical incidents, minimizing downtime and maintaining service continuity.
  • Led root cause investigations and implemented permanent fixes, enhancing system reliability and reducing repeat incidents.
  • Coordinated with product and support teams to deliver customer-facing fixes within SLA, improving operational responsiveness and user experience.

Education

Birla Institute of Technology, Mesra

Bachelor of Engineering (BE) — Electronics & Communication Engineering

Jan 2008Jan 2012

University of Michigan

Leading Teams

Jan 2020Jan 2020

DAV Public School, CCL, Ranchi

10+2 (AISSCE) — PCM

Jan 2005Jan 2007

DAV Public School, NTPC, Kahalgaon

10th (AISSE)

Jan 1999Jan 2005

Stackforce found 100+ more professionals with Devops & Site Reliability Engineering

Explore similar profiles based on matching skills and experience