Mohd Iqubal

SRE (Site Reliability Engineer)

New Delhi, Delhi, India10 yrs 5 mos experience

Highly Stable

Key Highlights

Reduced incident detection time from 19 to 3 minutes.
Built SRE and observability foundation from scratch.
Operated one of the largest Kubernetes platforms globally.

Stackforce AI infers this person is a SaaS Infrastructure Engineer with extensive experience in Site Reliability Engineering.

Contact

Skills

Core Skills

Site Reliability EngineeringObservability & MonitoringAutomationPlatform EngineeringIncident Management

Other Skills

KubernetesAWSPythonPrometheusGrafanaOpenTelemetryCI/CDObservabilityDatadogSplunkProblem ManagementRoot Cause AnalysisInfrastructure as code (IaC)TerraformContinuous Integration and Continuous Delivery (CI/CD)

About

Some engineers write the runbook. I have been the one on-call when the runbook runs out. 13+ years keeping production systems alive from traditional ops to SRE to platform engineering at genuine scale. My longest chapter was Expedia Group, ~8 years, where I worked on a Kubernetes platform running 600+ clusters, 8,500 nodes, and 16,000 pods one of the largest of its kind globally. In that time I reduced incident detection time from 19 minutes to 3 minutes. Cut recovery time from 327 to 167 minutes. And helped track down a silent checkout failure that was costing real bookings every minute...fixing it improved conversion by 8%. Right now I'm building a full SRE and observability foundation from scratch for a SaaS platform: logging, metrics, and distributed tracing using Elastic Stack, Prometheus, Grafana, and OpenTelemetry. Starting from zero and making it production-ready is honestly one of the most satisfying things you can do in this field. My strength is knowing how systems actually fail and building the people, processes, and tooling around that reality.

Experience

10 yrs 5 mos

Total Experience

3 yrs 5 mos

Average Tenure

Current Experience

Devo

Senior Site Reliability Engineer

Nov 2024 – Oct 2025 · 11 mos

Managed multi-tenant Kubernetes workloads on AWS, maintaining 99.99% SLA compliance across customer-facing SaaS services.
Automated operational workflows using Python, reducing manual ticket volume by ~40% and improving response consistency.
Enhanced observability using Prometheus, Grafana, and OpenTelemetry, reducing alert noise and improving incident triage time.
Identified and removed unused EBS volumes across shared AWS accounts, improving cloud cost efficiency and governance visibility.

KubernetesAWSPythonPrometheusGrafanaOpenTelemetry+2

Expedia group

Infrastructure & Enterprise Engineer III | Senior Reliability Engineer

Feb 2017 – Oct 2024 · 7 yrs 8 mos · Gurugram, Haryana, India

Operated federated Kubernetes Runtime Compute Platform (600+ clusters, 8,500 nodes, 160k pods) on AWS supporting thousands of internal developer workloads globally.
Reduced MTTD from 19 → 3 minutes and MTTR by ~50% by redesigning alerting strategy, tuning SLIs/SLOs, and automating incident workflows.
Led Sev-1/Sev-2 production incidents, coordinating cross-functional platform and service teams to restore stability under high-traffic conditions.
Migrated services from Mesos/Nomad to Kubernetes RCP, improving deployment consistency and platform resilience.
Improved CI/CD reliability (Spinnaker, Jenkins, GitHub Actions), increasing deployment stability and reducing rollback frequency.
Integrated observability stack (Grafana, Datadog, Splunk) and platform security controls (Vault, OPA, Teleport) across multi-cluster environments.

KubernetesAWSIncident ManagementCI/CDObservabilityDatadog+3

Hcl technologies

Software Engineer | SME Problem Management

Apr 2012 – Feb 2014 · 1 yr 10 mos · Noida, Uttar Pradesh, India · On-site

Led 24x7 production reliability for high-volume global e-commerce and transaction platforms, ensuring SLA adherence across distributed environments.
Drove structured RCA processes, identifying systemic failure patterns and implementing preventive fixes to reduce recurring production incidents.
Collaborated with global engineering teams to stabilize critical applications under peak traffic conditions.
Recognized with multiple HCL Livewire Awards for reliability excellence and incident resolution leadership.

Problem ManagementRoot Cause AnalysisIncident Management