Sudeep Gupta — SRE (Site Reliability Engineer)

I’m a Staff-level Site Reliability and Platform Engineer with over a decade of experience designing and scaling reliable cloud-native infrastructure and internal platforms that support large-scale distributed systems, analytics, and AI workloads in enterprise environments.My work sits at the intersection of reliability engineering, platform architecture, and developer productivity - helping engineering teams ship faster, operate safer, and scale systems with confidence while maintaining strong operational and cost discipline.Over the years, I’ve led initiatives to:- Design and scale internal developer platforms built on Kubernetes, Terraform, and GitOps, enabling standardized and high-velocity deployments across large engineering organizations- Automate observability, incident response, and reliability workflows, improving system resilience and reducing operational toil in distributed environments- Build and operate ML and data infrastructure platforms (Airflow, Databricks, Spark, GCP) supporting training, inference, and large-scale data processing workloads- Modernize infrastructure architecture to improve performance, optimize resource utilization, and reduce cloud cost through automation and spot-instance orchestrationI enjoy driving end-to-end platform outcomes - from architecture and automation to cross-team enablement and reliability strategy. My approach to SRE is systems-oriented and I gravitate toward building foundational platforms and frameworks that simplify complexity and create leverage for large engineering teams through design, automation, and scalable abstractions rather than reactive operations. I’ve worked across B2C, SaaS, AI, and Financial Analytics domains in high-scale, high-stakes environments where platform reliability, developer velocity, and cost efficiency are critical. My work has driven faster deployment cycles, multi-region reliability improvements, and significant infrastructure cost optimizations at enterprise scale.🔹 Core Skills: Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure (AWS/GCP), Kubernetes, Go, Terraform, GitOps, Observability (Prometheus, Grafana, OpenTelemetry), CI/CD, Distributed Systems, Databricks, Airflow, Spark, Python🔹 Focus Areas: Platform Reliability • Internal Developer Platforms • ML/AI Infrastructure • Developer Experience • Cost Optimization • Automation Strategy

Stackforce AI infers this person is a SaaS and Fintech expert with a strong focus on cloud infrastructure and data engineering.

Location: New Delhi, Delhi, India

Experience: 13 yrs 6 mos

Skills

Site Reliability Engineering
Platform Engineering
Cloud Infrastructure
Data Engineering

Career Highlights

Over a decade of experience in SRE and Platform Engineering.
Led initiatives to optimize cloud infrastructure costs significantly.
Expert in building scalable, reliable cloud-native platforms.

Work Experience

Avalara

Lead Site Reliability Engineer (2 yrs)

FARFETCH

Senior Infrastructure Engineer (3 yrs 11 mos)

Infrastructure Engineer (2 yrs)

BlackRock

Associate (2 yrs 9 mos)

Fractal Analytics

Senior Data Engineer (9 mos)

Data Engineer (9 mos)

stealth mode start-up

Data Scientist and Technical Lead (1 yr)

InnovAccer

SDE (5 mos)

IIIT Delhi

Research Associate (3 mos)

Graduate Teaching Assistant (1 yr 8 mos)

Education

Master's Degree at Indraprastha Institute of Information Technology, Delhi

B.Tech at Guru Gobind Singh Indraprastha University

at Montfort Senior Secondary School, Ashok Vihar, Delhi

Sudeep Gupta

SRE (Site Reliability Engineer)

New Delhi, Delhi, India13 yrs 6 mos experience

AI ML PractitionerHighly Stable

Key Highlights

Over a decade of experience in SRE and Platform Engineering.
Led initiatives to optimize cloud infrastructure costs significantly.
Expert in building scalable, reliable cloud-native platforms.

Stackforce AI infers this person is a SaaS and Fintech expert with a strong focus on cloud infrastructure and data engineering.

Contact

sudeepgupta90@gmail.com LinkedIn

Skills

Core Skills

Site Reliability EngineeringPlatform EngineeringCloud InfrastructureData Engineering

Other Skills

Agentic AIAirflowAlgorithmsAmazon Web Services (AWS)AnsibleApache SparkArgoCDBig DataBusiness IntelligenceCC++Cloud ComputingData AnalysisData AnalyticsData Architecture

About

Experience

Avalara

Lead Site Reliability Engineer

Mar 2024 – Present · 2 yrs · India · Remote

Lead the design and development of of internal developer platform and automation tools, while overseeing
technical delivery, engineering practices, and scalable solutions that improve developer productivity and
operational efficiency across product/engineering teams.
Architected and developed a metrics-driven SRE compliance platform (Go, Kafka, GitLab,
Prometheus) that replaced manual release governance with continuous policy evaluation, reducing
deployment lead time by 80% while improving release stability at scale
Designed and built a Go-based configuration templating engine and validation system for
Kubernetes and multi-environment deployments, reducing misconfiguration incidents by 40% and
improving deployment hygiene
Designed AI-driven operational tooling integrating Prometheus metrics, logs, and deployment
signals to accelerate root-cause analysis for services deployed on the platform
Partnered with Platform, Product, and Engineering Leadership to align infrastructure and reliability
initiatives with organizational delivery and uptime goals
Lead a globally distributed team of 4 engineers and increased the team velocity by 30% with Agile
coaching, and continuous feedback loops

Agentic AITeam LeadershipTeam ManagementStrategic PlanningTerraformSoftware Project Management+9

Farfetch

2 roles

Senior Infrastructure Engineer

Promoted

Apr 2020 – Mar 2024 · 3 yrs 11 mos

Built and scaled centralized cloud infrastructure and internal platforms supporting Application workloads
and Analytics and MLOps workloads, with a focus on reliability, observability, and cost-efficient
infrastructure automation for global engineering teams.
Architected and deployed an Airflow Platform-as-a-Service (Terraform, ArgoCD, Helm) with
custom RBAC, secrets backend, centralized logging, and observability; led migration from Google
Cloud Composer, reducing platform costs by 70% while improving operational control and
reliability
Designed and implemented a highly available Prometheus observability stack integrated with
PagerDuty and DeadMansSnitch, achieving 99.99% platform uptime and saving $500K annually
by retiring Azure Container Insights
Introduced spot-instance orchestration across Kubernetes and Databricks workloads, optimized
GPU and compute utilization, and reduced annual infrastructure spend by $100K+
Implemented governance and cost observability frameworks (Databricks Overwatch) to provide
automated insights into platform inefficiencies and resource usage patterns

PythonGitOpsTerraformSite Reliability EngineeringProblem SolvingMicrosoft Azure+10

Infrastructure Engineer

Apr 2020 – Apr 2022 · 2 yrs

Working on developing Data Platform for Data and MLOps

GitOpsSite Reliability EngineeringProblem SolvingGo (Programming Language)ArgoCDInfrastructure+2

Blackrock

Associate

Jul 2017 – Apr 2020 · 2 yrs 9 mos · Gurugram

Part of the Financial Modeling Group’s Data Infrastructure initiative, focused on modernizing and automating big data systems for quantitative equity research and analytics.
Architected a modular Data Fabric for Equity Research to automate ingestion and storage of multi-
source structured and semi-structured datasets, enabling standardized signal generation workflows
and reusable compute/analytics layers across research teams (GCP, Python, Flask, MongoDB,
Ansible)
Led migration of on-prem mortgage asset modeling infrastructure to GCP and Airflow (Composer),
modernizing legacy batch pipelines, reducing runtime from 48 hours to 10 hours, and significantly
improving research iteration cycles
Designed and implemented a scalable Data Lake platform for low-latency interactive analytics,
establishing governance and data organization patterns (Medallion-style layering) to prevent data
swamp and support large-scale analytical workloads (Hadoop, Spark, Presto)
Engineered performance-critical internal tooling (FTPSync) for distributed file system
synchronization across HDFS, NFS, and object storage, reducing algorithmic complexity from
O(n²) to O(n) and lowering memory footprint by 20%
Impact: Accelerated research workflows and enabled scalable data processing for global investment teams.

Data EngineeringProblem SolvingCloud ComputingDevOpsKubernetesApache Spark+2

Fractal analytics

2 roles

Senior Data Engineer

Promoted

Oct 2016 – Jul 2017 · 9 mos

Worked on large-scale Big Data and Advanced Analytics systems for strategic enterprise and public-sector
clients, focusing on distributed data pipelines, performance optimization, and scalable data infrastructure
for analytics-driven decision making.
Developed distributed financial fraud detection pipelines using Spark, Neo4j, and Python to identify
fraud rings and shell entities; optimized pipeline architecture to reduce runtime from 12+ hours to
under 2 hours for large graph based datasets
Designed and implemented Hadoop-based Data Lake and ETL frameworks (Hive, Spark, Ranger,
Avro) integrating structured and semi-structured data sources, enabling scalable analytics and
warehousing on self-hosted Hortonworks clusters
Engineered production-grade data processing workflows for high-volume analytical workloads,
improving reliability, data consistency, and execution efficiency across client environments

Problem SolvingInfrastructureCloud ComputingDevOps

Data Engineer

Jan 2016 – Oct 2016 · 9 mos

Problem SolvingCloud Computing

Stealth mode start-up

Data Scientist and Technical Lead

Jan 2015 – Jan 2016 · 1 yr · New Delhi Area, India

Worked on Machine Learning based Business solutions to provide actionable business insights. I worked on various problems such as user behaviour modelling, customer segmentation, sentiment analysis etc.
Technology Stack: python, flask, pandas, scikits, numpy, MongoDB, elastic-search

Problem SolvingCloud Computing

Innovaccer

SDE

Aug 2014 – Jan 2015 · 5 mos

Worked on structured data mining, API development, implemented large scale distributed crawlers to collect information from various sources.
Responsibilities also included direct client interaction, cost and business
analysis, and end to end project delivery.
Technology Stack: python, Google Compute Engine

Problem SolvingCloud Computing