Nitin Gupta

CTO

Sammamish, Washington, United States15 yrs 11 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Led engineering teams at Meta for 10+ years.
  • Scaled systems 100x while improving data freshness.
  • Reduced investigation time for critical alerts by 50%.
Stackforce AI infers this person is a highly experienced leader in AI Infrastructure and Distributed Systems.

Contact

Skills

Core Skills

Ai InfraDistributed SystemsMl SystemsInfrastructureData EngineeringSoftware DevelopmentFramework Development

Other Skills

LLM inferenceAI deploymentscalable systemsStream Processingreal-time datadata freshnessMonitoringObservabilityinvestigation automationreal-time analyticsbilling insightsreliability improvementcross-platform frameworkmigrationfuture-proofing

About

I build the infrastructure that makes frontier AI fast, reliable, and scalable. Over the past 10+ years at Meta, I've led engineering teams across some of the most demanding infrastructure domains in the industry - from LLM inference for frontier models, Stream Processing at petabyte scale, to the Observability systems that keep Meta's products running for billions of users. Today, I'm part of Meta SuperIntelligence Labs, where my team owns the LLM inference platform to deliver frontier models like MUSE Spark. We optimize for latency, throughput, and cost efficiency across both online serving and multimodal training workloads — the kind of systems where a 10ms improvement matters at millions of QPS. Before this, I led the Meta's stream processing platform (XStream) where we powered real-time data infra for AI/ML — directly influencing the content ranking for Instagram Reels and Facebook Feed. We scaled systems 100x while improving data freshness to sub-second latency. Earlier, I built Meta's end-to-end Monitoring & Observability infrastructure and delivered AI-powered investigations, before the LLM buzz and when industry was still figuring out what it looks like, reducing investigation time for critical alerts at Meta by 50%. Before Meta: at AWS, I led the EBS real-time analytics platform, taking reliability from 98% → 99.99% as EBS grew into a multi-billion dollar product. At Adobe, I developed the cross-platform framework that moved Photoshop, Illustrator, and Dreamweaver from Carbon to Cocoa.

Experience

15 yrs 11 mos
Total Experience
4 yrs
Average Tenure
10 yrs 9 mos
Current Experience

Meta

3 roles

Engineering Leader

Promoted

Mar 2025Present · 1 yr 3 mos · On-site

  • Meta Superintelligence Labs - Building LLM inference platform for frontier models—optimized for scalable, efficient, and high-performance AI deployment for production (Online Serving) and batch (MultiModal Training) workloads.
  • Previously built infrastructure to accelerate research velocity, enabling faster LLM training, experimentation, and iteration for frontier-scale models.
LLM inferenceAI deploymentscalable systemsAI InfraDistributed Systems

Senior Engineering Manager

Apr 2023Mar 2025 · 1 yr 11 mos · On-site

  • Stream Processing team builds and manages Xstream, a managed Stream Processing Service for Meta Scale. We support mission critical workloads ranging from infrastructural services, to powering the realtime feature & training data generation for AI/ML domain.
  • My teams are working on scaling our systems 100x and improving the freshness of data-processing to seconds which improves the quality of content users see on our apps like IG Reels and Facebook Feed.
Stream Processingreal-time datadata freshnessML systemsDistributed Systems

Engineering Manager

Sep 2015Apr 2023 · 7 yrs 7 mos · On-site

  • Monitoring team is responsible for building infrastructure for end-to-end reliability of products & services at Meta. This is one of the most critical pieces of infrastructure at Meta and is used by everyone at Meta to detect & investigate issues in production systems.
  • My teams are responsible for developing products & services to help developers at Facebook quickly investigate and remediate issues in production systems to reduce impact of outages at Meta. We are doing so by innovating new products to simplify & automate investigations.
  • Some of the work from my teams:
  • Systems@Scale conference: https://www.youtube.com/watch?v=LUAbZYp8e6o
  • LISA 2019 conference: https://www.youtube.com/watch?v=nQg1jJNpAi4&t=2062s
  • Root Cause Analysis at Scale: https://engineering.fb.com/developer-tools/fast-dimensional-analysis/
MonitoringObservabilityinvestigation automationAI InfraDistributed Systems

Amazon web services (aws)

Software Development Engineer

Feb 2012Aug 2015 · 3 yrs 6 mos · Greater Seattle Area

  • Led EBS real-time analytics platform for billing and volume insights, improving reliability from 98% to 99.99% as EBS grew into a multi-billion dollar revenue product.
  • Previously modernized the AWS commerce platform workflow engine for metering and billing to support the rapid growth for AWS during 2012-2014, migrated critical billing and customer data with zero downtime.
real-time analyticsbilling insightsreliability improvementInfrastructureData Engineering

Adobe

Member Technical Staff

May 2010Jan 2012 · 1 yr 8 mos · Noida

  • Lead developer for the OS-agnostic framework (Drover) from Adobe Illustrator team allowing smooth migration of Adobe products (including Photoshop & Dreamweaver) from MAC OS Carbon to Cocoa framework as well as future-proofing Adobe products from backward-imcompatible OS updates
  • Received several awards including “Special Contribution Award” and “Best Debutant Developer” award
cross-platform frameworkmigrationfuture-proofingSoftware DevelopmentFramework Development

Google summer of code

Open Source Developer

May 2010Aug 2010 · 3 mos · Remote

  • Built Facebook-like Micropublisher (https://publicmind.in/blog/fbsmp_preview/) which allowed thousands of developers worldwide to build their own social network websites. Worked with CEOs to launch their social networking sites.

L3s research center

Research Intern

May 2009Jul 2009 · 2 mos · Hannover, Germany

  • Carried out background research and contributed to the conceptualization, design and implementation of a prototype system for the project "Web history tools for future browsers".

Srijan technologies pvt. ltd.

Open Source Developer

Dec 2008Dec 2008 · 0 mo · New Delhi, India

  • Developed a module for the Drupal Content Management System. Used by thousands of websites.
  • http://drupal.org/project/feedapi_imagegrabber
  • http://drupal.org/project/feeds_imagegrabber

Education

Indian Institute of Technology, Guwahati

B. Tech — Computer Science and Engineering

Jan 2006Jan 2010

Stackforce found 100+ more professionals with Ai Infra & Distributed Systems

Explore similar profiles based on matching skills and experience