Aamir Ansari

SRE (Site Reliability Engineer)

Mumbai, Maharashtra, India4 yrs 4 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

Expert in Kubernetes and AWS EKS management.
Strong focus on observability and incident response.
Proficient in deploying microservices with Helm.

Stackforce AI infers this person is a SaaS Infrastructure Engineer with strong expertise in cloud-native deployments.

Contact

Skills

Core Skills

KubernetesAwsDatadogMonitoringDeployment

Other Skills

AWS EKSAmazon Web Services (AWS)AnsibleBashCICassandraCommunicationComputer ScienceContainerizationContinuous Delivery (CD)Continuous Integration (CI)DatabasesDefining RequirementsDeployment PlanningDesign

About

Site Reliability Engineer with 4 years of experience in Kubernetes (AWS EKS), Helm, Terraform, CI/CD, Linux, and cloud-native production systems. Skilled in rolling updates, blue-green deployments, and fixing production-level incidents in distributed environments. Strong in Datadog monitoring, SLO/SLI design, error/frustration dashboards, networking, and automation workflows. Hands-on with Python, Bash, IaC, Consul, microservices debugging, and optimizing reliability, scalability, and performance. Focused on high availability, observability, root-cause analysis, and delivering stable, efficient, and resilient production infrastructure across AWS and on-prem systems.

Experience

4 yrs 4 mos

Total Experience

2 yrs 10 mos

Average Tenure

4 yrs 3 mos

Current Experience

Cloudbees

Site Reliability Engineer

Jan 2024 – Present · 2 yrs 5 mos · Remote

I’m part of the Platform Engineering team, where we manage infrastructure using Terraform and Pulumi. I’m responsible for production deployments on our AWS EKS clusters, ensuring reliability and scalability across environments.
Our platform leverages Kubernetes (EKS) for orchestration, automated through GitHub Actions and CloudBees CI. We use Datadog for observability — defining and tracking SLOs/SLIs to maintain platform stability — and PagerDuty for alerting and incident response.
As part of observability initiatives, I collaborate closely with UI teams to identify user frustrations through RUM (Real User Monitoring) sessions and troubleshoot frontend issues. This includes inspecting network activity and APIs via browser developer tools to validate performance and service reliability.
We deploy microservices via Helm charts and Helmfiles, where I’ve contributed to creating and maintaining charts to streamline and standardize our deployment workflows.
We manage Cassandra and PostgreSQL databases, use NATS for messaging, and HashiCorp Vault for secrets management.
I’ve also contributed to service and infrastructure migrations, improving automation, monitoring, and deployment pipelines.

TerraformPulumiAWS EKSKubernetesDatadogSLOs/SLIs+6