Mahendra Singh

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India12 yrs 9 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

12+ years of experience in SRE and DevOps.
Expert in designing scalable, cloud-native platforms.
Proven track record in optimizing system performance.

Stackforce AI infers this person is a Cloud Infrastructure Engineer with expertise in DevOps and Site Reliability Engineering.

Contact

Skills

Core Skills

Site Reliability EngineeringInfrastructure As CodeCloud ManagementCloud Platform ManagementCloud OperationsInfrastructure ManagementSystem Administration

Other Skills

AWSAWS CloudFormationAgileAgile MethodologiesAlertingAmazon S3Amazon VPCAnalytical SkillsAnsibleApplication SecurityApplication SupportBackup & Recovery SystemsCapacity PlanningCloud AutomationCloudFormation

About

A seasoned professional SRE | DevOps | Infrastructure Engineering Leader with 12+ years of experience building and mentoring high-performing global teams. I specialized in designing scalable, cloud-native platforms (AWS), implementing Infrastructure as Code (Terraform, CloudFormation), and leading automation strategies that drive operational excellence and reliability. Proven track record in delivering end-to-end infrastructure solutions, optimizing system performance, overseeing the deployment, infrastructure configuration, and aligning technical execution with business goals. Skilled in Agile delivery, cross-functional collaboration, stakeholder management, and setting engineering OKRs that deliver measurable impact.

Experience

12 yrs 9 mos

Total Experience

2 yrs 6 mos

Average Tenure

7 yrs 2 mos

Current Experience

Atlassian

SRE/DevOps Engineering Leader

Apr 2019 – Present · 7 yrs 2 mos · Bengaluru, Karnataka, India · Remote

Led and mentored globally distributed teams, driving reliability, scalability, and performance of business-critical services hosted on AWS.
Architect and manage multi-account AWS infrastructure using Terraform, Ansible, and CloudFormation, ensuring 99.99% uptime and delivering highly available, fault-tolerant systems.
Led Infrastructure-as-Code (IaC) automation using Terraform, streamlining infrastructure provisioning and implementing GitOps workflows with Bitbucket Pipelines, cutting deployment time, eliminating manual processes, and reducing configuration drift.
Built a unified observability and alerting framework with CloudWatch, SignalFx, and LogicMonitor, enhancing incident detection and reducing mean time to resolution (MTTR).
Drove AWS cost optimization strategy, reducing annual cloud spend through re-architecting, reserved instance planning, unused resource clean-up, autoscaling, and storage lifecycle policies.
Implemented disaster recovery strategies for critical services using AWS multi-region and failover architectures, achieving service-tier-based RTO and RPO targets.
Led 24/7 incident response operations, including on-call rotations and escalation workflows, ensuring SLA adherence and service continuity for critical automation and reporting platform services.
Collaborated cross-functionally with product, security, and application teams to align infrastructure initiatives with compliance, release management, and business continuity objectives/goals.
Implemented Agile engineering practices to improve delivery predictability.
Developed high-performing teams through coaching, structured career planning, and feedback cycles, maintaining team retention and fostering multiple internal promotions.

AWSTerraformAnsibleCloudFormationCloudWatchSignalFx+5

Altisource

Tech Lead Cloud Platform

Nov 2016 – Jan 2019 · 2 yrs 2 mos · Bangalore · Hybrid

Led 24/7 reliability operations for multiple services across AWS and Verizon Cloud, achieving and sustaining 99.99% availability for critical business platforms.
Built and led a high-impact team of engineers; implemented a streamlined onboarding and training framework, accelerating time-to-productivity.
Delivered infrastructure automation solutions from scratch by establishing Infrastructure-as-Code (IaC) standards with Terraform and DevOps tooling, which reduced provisioning time and deployment time, while significantly improving platform agility.
Scaled monitoring infrastructure (Zabbix) for 1,00+ endpoints; reduced noise and enhanced signal-to-noise ratio for faster issue triage.
Resolved incidents within SLA by building a culture of operational excellence, standardizing RCA processes, and deploying automated recovery playbooks.
Oversaw security compliance through automated quarterly patching of cloud servers, maintaining 100% compliance across regulated environments.

AWSTerraformDevOpsZabbixIncident ManagementSecurity Compliance+2

Freecharge

Cloud Operations Engineer

Oct 2015 – Nov 2016 · 1 yr 1 mo · Bangalore

Managed infrastructure operations for 300+ servers, ensuring high availability.
Monitored system metrics (CPU, Memory, Disk) proactively, executing remediation to prevent downtime.
Troubleshot and resolved infrastructure incidents; led root cause analysis (RCA) to improve system reliability.

Infrastructure OperationsSystem MonitoringTroubleshootingCloud Operations

Css corp

System Engineer

Jul 2014 – Oct 2015 · 1 yr 3 mos · Chennai, Tamil Nadu, India

• Oversaw 24/7 operations of large-scale AT&T infrastructure, leveraging Nagios-based monitoring and alerting to ensure high availability, system reliability, and rapid issue detection across distributed environments.

MonitoringAlertingSystem ReliabilityInfrastructure Management

Radical technologies pvt.ltd.

Linux Administrator

Jun 2013 – Jul 2014 · 1 yr 1 mo · Pune, Maharashtra, India

• Performed Linux (RHEL) system management, including package installation, upgrades, and configuration; delivered L1 troubleshooting and issue resolution to ensure system stability and performance.

Linux System ManagementTroubleshootingSystem Administration