Anant Kumar

SRE (Site Reliability Engineer)

Hyderabad, Telangana, India13 yrs 3 mos experience

Highly Stable

Key Highlights

Expert in building and leading SRE teams.
Proven track record in multi-cloud CI/CD pipeline development.
Strong focus on incident management and operational excellence.

Stackforce AI infers this person is a DevOps and Site Reliability Engineering expert in the SaaS industry.

Contact

Skills

Core Skills

DevopsSite Reliability EngineeringContinuous Integration And Continuous Delivery (ci/cd)Infrastructure Automation

Other Skills

AWSAnsibleBashChaos EngineeringCloud Computing IaaSCloud DevelopmentContinuous Integration (CI)Control MDatadogDockerELKFinancial OperationsGitGitLabGithub

About

I am simple thinking,hard working and sincere in all assignments. I always keep "can do approach" for any assignment. I like challenging jobs and always put forth my best to achieve the goal.

Experience

13 yrs 3 mos

Total Experience

1 yr 9 mos

Average Tenure

1 yr 4 mos

Current Experience

Lloyds technology centre india

DevSecOps Engineering (Grade E)

Jan 2025 – Present · 1 yr 4 mos · Hyderabad, Telangana, India · Hybrid

Led and mentored a team of Site Reliability Engineers, fostering skill development, reliability-focused culture, and operational excellence.
Designed, built, and maintained multi-cloud CI/CD pipelines using GitHub Actions for deployments across AWS, GCP, and Azure.
Established robust incident response frameworks, driving root cause analysis (RCA) governance and continuous post-incident improvement processes.
Developed OS-level guardrails and implemented cloud security policies leveraging Wiz, enhancing compliance and security posture across environments.
Automated provisioning and configuration of multi-cloud infrastructure using Terraform, improving scalability and reducing manual effort.
Managed and optimized Kubernetes workloads to ensure high availability, compliance, and cost efficiency across clusters.

GithubTerraformGoogle Cloud Platform (GCP)Microsoft AzureAWSContinuous Integration and Continuous Delivery (CI/CD)+2

Demandbase

Principal Engineer (SRE/DevOps/Platform)

Sep 2021 – Jan 2025 · 3 yrs 4 mos · Hyderabad, Telangana, India · Remote

Led cloud infrastructure strategy and delivery, defined roadmaps, set OKRs and driven cross-team alignment to build secure, scalable, and automated platforms.
Designed, built, and maintained CICD pipelines for multiple microservices across multi-cloud environments (AWS, GCP), ensuring scalability and reliability.
Implemented cloud cost optimization strategies across AWS and GCP, reducing overall infrastructure spend by ~25% through rightsizing, workload scheduling, and monitoring unused resources.
Built an automated AWS and GCP provisioning platform through Terraform IaC and GitLab, reducing setup time by 70% and improving governance.
Designed and operated AWS EKS and GCP GKE Kubernetes clusters, improving deployment speed by 40%.
Implemented a unified observability stack (Prometheus, Grafana, Datadog), reducing incident detection time by 30%.
Established SLIs (Latency, Errors, Traffic, Saturation) and SLOs with Error Budgets across key microservices, driving proactive monitoring and reliability improvements.
Led incident management and postmortem processes for critical production services, reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) by 30% through improved alerting and automation.
Designed and rolled out GitOps pipelines with FluxCD to automate Kubernetes deployments, ensuring version-controlled, auditable, and policy-compliant delivery processes.
Automated key operational tasks using Python and Bash, reducing manual toil by 40% and improving team efficiency and response times.
Drove the adoption of a Shift Left approach by integrating developer productivity and automation tools, improving code quality and accelerating feedback cycles across development teams.
Spearheaded PoC and PoV evaluations of emerging cloud-native technologies, resulting in successful tool adoption and enhanced operational efficiency.

KubernetesContinuous Integration and Continuous Delivery (CI/CD)Financial OperationsTerraformAWSGoogle Cloud Platform (GCP)+3

Walmart global tech

Senior Software Engineer-IV (DevOps/Platform)

Oct 2020 – Sep 2021 · 11 mos · Bengaluru, Karnataka, India · On-site

Led and mentored the Platform Engineering team, promoting adoption of Site Reliability Engineering (SRE) best practices to improve system resilience and operational maturity.
Automated infrastructure provisioning and configuration using Terraform, Python, and Ansible, enhancing deployment consistency and reducing manual effort.
Designed and implemented CI/CD pipelines with Jenkins and Concord, enabling global, scalable, and secure application deployments.
Directed incident response and production troubleshooting efforts, establishing a blameless RCA culture and improving mean time to recovery (MTTR).
Built and managed observability frameworks leveraging Datadog, Prometheus, Grafana, and Splunk, ensuring centralized monitoring, log analytics, and proactive alerting.
Administered and optimized Splunk clusters, managing indexers, search heads, and ingestion pipelines to improve performance and query efficiency.

TerraformPythonAnsibleJenkinsDatadogPrometheus+4

Lowe's india

Lead Site Reliability Engineer

Jan 2019 – Jan 2020 · 1 yr · Greater Bengaluru Area · On-site

Built and led the SRE team from the ground up, including hiring, onboarding, and mentoring engineers to establish a culture of reliability and operational excellence.
Orchestrated and managed microservices on Google Kubernetes Engine (GKE), leading migration of critical workloads from on-premise infrastructure to Google Cloud Platform (GCP).
Defined and implemented SLIs, SLOs, SLAs, and Error Budgets, improving observability, accountability, and overall system reliability.
Designed and implemented capacity planning and performance engineering frameworks, ensuring scalability and optimal resource utilization.
Established comprehensive incident management processes, including blameless RCA governance, runbooks, and monitoring dashboards for faster issue resolution.
Introduced and operationalized chaos engineering practices to validate system resilience and proactively address potential failure scenarios.

Google Cloud Platform (GCP)Chaos EngineeringSREContinuous Integration and Continuous Delivery (CI/CD)MentoringSite Reliability Engineering

Nextgen healthcare

Senior Site Reliability Engineer

Aug 2017 – Sep 2019 · 2 yrs 1 mo · Bangalore · On-site

Implemented end-to-end observability stack using Datadog and Sumologic, improving incident detection and reducing alert fatigue.
Automated provisioning and deployment of infrastructure using Terraform and Ansible, cutting manual operations effort by 50%.
Collaborated with development teams to define and monitor SLIs/SLOs, enabling data-driven reliability improvements and faster recovery from failures.
Enhanced CI/CD pipelines with Jenkins, integrating automated testing, canary and blue-green deployments, and rollback mechanisms.
Participated in on-call rotations and conducted blameless postmortems, identifying recurring issues and driving long-term remediation efforts.

DatadogTerraformAnsibleJenkinsSite Reliability Engineering

Sap

Senior Cloud Support Engineer (SaaS)

Jun 2016 – Aug 2017 · 1 yr 2 mos · Bengaluru Area, India · On-site

Automated repetitive operational workflows using Python and Bash, improving system efficiency and reducing manual intervention.
Monitored and optimized production environments through Splunk and Pingdom, configuring proactive alerts to ensure high availability and rapid issue detection.
Supported SaaS release engineering and data migration initiatives, ensuring smooth deployments and minimal downtime during critical releases.
Strengthened change management and incident response processes by improving root cause tracking and enhancing team coordination during major incidents.

PythonBashSplunk

Bankbazaar.com

Production Support Engineer

Nov 2014 – Jun 2016 · 1 yr 7 mos · Bangalore · On-site

Spearheaded the migration of legacy on-prem workloads to AWS, implementing secure Redshift cluster architecture and ensuring seamless data transition.
Administered and enhanced Jenkins CI/CD pipelines to streamline release processes and strengthen deployment reliability across environments.
Drove production incident management and RCA processes, ensuring SLA adherence and contributing to continuous reliability improvements.

AWSJenkinsLinuxSQLDevOps

Cognizant technology solutions

Programmer Analyst

May 2012 – Nov 2014 · 2 yrs 6 mos · Greater Hyderabad Area · On-site

Proactively monitored production environments and resolved critical incidents, minimizing downtime and maintaining service continuity.
Led root cause investigations and implemented permanent fixes, enhancing system reliability and reducing repeat incidents.
Coordinated with product and support teams to deliver customer-facing fixes within SLA, improving operational responsiveness and user experience.