Sudeep Gupta

SRE (Site Reliability Engineer)

New Delhi, Delhi, India13 yrs 6 mos experience
AI ML PractitionerHighly Stable

Key Highlights

  • Over a decade of experience in SRE and Platform Engineering.
  • Led initiatives to optimize cloud infrastructure costs significantly.
  • Expert in building scalable, reliable cloud-native platforms.
Stackforce AI infers this person is a SaaS and Fintech expert with a strong focus on cloud infrastructure and data engineering.

Contact

Skills

Core Skills

Site Reliability EngineeringPlatform EngineeringCloud InfrastructureData Engineering

Other Skills

Agentic AIAirflowAlgorithmsAmazon Web Services (AWS)AnsibleApache SparkArgoCDBig DataBusiness IntelligenceCC++Cloud ComputingData AnalysisData AnalyticsData Architecture

About

I’m a Staff-level Site Reliability and Platform Engineer with over a decade of experience designing and scaling reliable cloud-native infrastructure and internal platforms that support large-scale distributed systems, analytics, and AI workloads in enterprise environments.My work sits at the intersection of reliability engineering, platform architecture, and developer productivity - helping engineering teams ship faster, operate safer, and scale systems with confidence while maintaining strong operational and cost discipline.Over the years, I’ve led initiatives to:- Design and scale internal developer platforms built on Kubernetes, Terraform, and GitOps, enabling standardized and high-velocity deployments across large engineering organizations- Automate observability, incident response, and reliability workflows, improving system resilience and reducing operational toil in distributed environments- Build and operate ML and data infrastructure platforms (Airflow, Databricks, Spark, GCP) supporting training, inference, and large-scale data processing workloads- Modernize infrastructure architecture to improve performance, optimize resource utilization, and reduce cloud cost through automation and spot-instance orchestrationI enjoy driving end-to-end platform outcomes - from architecture and automation to cross-team enablement and reliability strategy. My approach to SRE is systems-oriented and I gravitate toward building foundational platforms and frameworks that simplify complexity and create leverage for large engineering teams through design, automation, and scalable abstractions rather than reactive operations. I’ve worked across B2C, SaaS, AI, and Financial Analytics domains in high-scale, high-stakes environments where platform reliability, developer velocity, and cost efficiency are critical. My work has driven faster deployment cycles, multi-region reliability improvements, and significant infrastructure cost optimizations at enterprise scale.🔹 Core Skills: Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure (AWS/GCP), Kubernetes, Go, Terraform, GitOps, Observability (Prometheus, Grafana, OpenTelemetry), CI/CD, Distributed Systems, Databricks, Airflow, Spark, Python🔹 Focus Areas: Platform Reliability • Internal Developer Platforms • ML/AI Infrastructure • Developer Experience • Cost Optimization • Automation Strategy

Experience

Avalara

Lead Site Reliability Engineer

Mar 2024Present · 2 yrs · India · Remote

  • Lead the design and development of of internal developer platform and automation tools, while overseeing
  • technical delivery, engineering practices, and scalable solutions that improve developer productivity and
  • operational efficiency across product/engineering teams.
  • Architected and developed a metrics-driven SRE compliance platform (Go, Kafka, GitLab,
  • Prometheus) that replaced manual release governance with continuous policy evaluation, reducing
  • deployment lead time by 80% while improving release stability at scale
  • Designed and built a Go-based configuration templating engine and validation system for
  • Kubernetes and multi-environment deployments, reducing misconfiguration incidents by 40% and
  • improving deployment hygiene
  • Designed AI-driven operational tooling integrating Prometheus metrics, logs, and deployment
  • signals to accelerate root-cause analysis for services deployed on the platform
  • Partnered with Platform, Product, and Engineering Leadership to align infrastructure and reliability
  • initiatives with organizational delivery and uptime goals
  • Lead a globally distributed team of 4 engineers and increased the team velocity by 30% with Agile
  • coaching, and continuous feedback loops
Agentic AITeam LeadershipTeam ManagementStrategic PlanningTerraformSoftware Project Management+9

Farfetch

2 roles

Senior Infrastructure Engineer

Promoted

Apr 2020Mar 2024 · 3 yrs 11 mos

  • Built and scaled centralized cloud infrastructure and internal platforms supporting Application workloads
  • and Analytics and MLOps workloads, with a focus on reliability, observability, and cost-efficient
  • infrastructure automation for global engineering teams.
  • Architected and deployed an Airflow Platform-as-a-Service (Terraform, ArgoCD, Helm) with
  • custom RBAC, secrets backend, centralized logging, and observability; led migration from Google
  • Cloud Composer, reducing platform costs by 70% while improving operational control and
  • reliability
  • Designed and implemented a highly available Prometheus observability stack integrated with
  • PagerDuty and DeadMansSnitch, achieving 99.99% platform uptime and saving $500K annually
  • by retiring Azure Container Insights
  • Introduced spot-instance orchestration across Kubernetes and Databricks workloads, optimized
  • GPU and compute utilization, and reduced annual infrastructure spend by $100K+
  • Implemented governance and cost observability frameworks (Databricks Overwatch) to provide
  • automated insights into platform inefficiencies and resource usage patterns
PythonGitOpsTerraformSite Reliability EngineeringProblem SolvingMicrosoft Azure+10

Infrastructure Engineer

Apr 2020Apr 2022 · 2 yrs

  • Working on developing Data Platform for Data and MLOps
GitOpsSite Reliability EngineeringProblem SolvingGo (Programming Language)ArgoCDInfrastructure+2

Blackrock

Associate

Jul 2017Apr 2020 · 2 yrs 9 mos · Gurugram

  • Part of the Financial Modeling Group’s Data Infrastructure initiative, focused on modernizing and automating big data systems for quantitative equity research and analytics.
  • Architected a modular Data Fabric for Equity Research to automate ingestion and storage of multi-
  • source structured and semi-structured datasets, enabling standardized signal generation workflows
  • and reusable compute/analytics layers across research teams (GCP, Python, Flask, MongoDB,
  • Ansible)
  • Led migration of on-prem mortgage asset modeling infrastructure to GCP and Airflow (Composer),
  • modernizing legacy batch pipelines, reducing runtime from 48 hours to 10 hours, and significantly
  • improving research iteration cycles
  • Designed and implemented a scalable Data Lake platform for low-latency interactive analytics,
  • establishing governance and data organization patterns (Medallion-style layering) to prevent data
  • swamp and support large-scale analytical workloads (Hadoop, Spark, Presto)
  • Engineered performance-critical internal tooling (FTPSync) for distributed file system
  • synchronization across HDFS, NFS, and object storage, reducing algorithmic complexity from
  • O(n²) to O(n) and lowering memory footprint by 20%
  • Impact: Accelerated research workflows and enabled scalable data processing for global investment teams.
Data EngineeringProblem SolvingCloud ComputingDevOpsKubernetesApache Spark+2

Fractal analytics

2 roles

Senior Data Engineer

Promoted

Oct 2016Jul 2017 · 9 mos

  • Worked on large-scale Big Data and Advanced Analytics systems for strategic enterprise and public-sector
  • clients, focusing on distributed data pipelines, performance optimization, and scalable data infrastructure
  • for analytics-driven decision making.
  • Developed distributed financial fraud detection pipelines using Spark, Neo4j, and Python to identify
  • fraud rings and shell entities; optimized pipeline architecture to reduce runtime from 12+ hours to
  • under 2 hours for large graph based datasets
  • Designed and implemented Hadoop-based Data Lake and ETL frameworks (Hive, Spark, Ranger,
  • Avro) integrating structured and semi-structured data sources, enabling scalable analytics and
  • warehousing on self-hosted Hortonworks clusters
  • Engineered production-grade data processing workflows for high-volume analytical workloads,
  • improving reliability, data consistency, and execution efficiency across client environments
Problem SolvingInfrastructureCloud ComputingDevOps

Data Engineer

Jan 2016Oct 2016 · 9 mos

Problem SolvingCloud Computing

Stealth mode start-up

Data Scientist and Technical Lead

Jan 2015Jan 2016 · 1 yr · New Delhi Area, India

  • Worked on Machine Learning based Business solutions to provide actionable business insights. I worked on various problems such as user behaviour modelling, customer segmentation, sentiment analysis etc.
  • Technology Stack: python, flask, pandas, scikits, numpy, MongoDB, elastic-search
Problem SolvingCloud Computing

Innovaccer

SDE

Aug 2014Jan 2015 · 5 mos

  • Worked on structured data mining, API development, implemented large scale distributed crawlers to collect information from various sources.
  • Responsibilities also included direct client interaction, cost and business
  • analysis, and end to end project delivery.
  • Technology Stack: python, Google Compute Engine
Problem SolvingCloud Computing

Iiit delhi

2 roles

Research Associate

May 2014Aug 2014 · 3 mos

  • I worked on problems which included information retrieval, ranking and social network visualisation.
  • Technology Stack- python, networkx
Problem SolvingCloud Computing

Graduate Teaching Assistant

Aug 2012Apr 2014 · 1 yr 8 mos

  • Responsibilities included substitute lectures, organising tutorials and practical lab sessions, lab viva, evaluation of homework and exam sheets.
Problem Solving

Education

Indraprastha Institute of Information Technology, Delhi

Master's Degree — Computer Science

Jan 2012Jan 2014

Guru Gobind Singh Indraprastha University

B.Tech — Computer Science Engineering

Jan 2008Jan 2012

Montfort Senior Secondary School, Ashok Vihar, Delhi

Jan 1994Jan 2008

Stackforce found 100+ more professionals with Site Reliability Engineering & Platform Engineering

Explore similar profiles based on matching skills and experience