N

Naveen Cherukuri

CTO

Sunnyvale, California, United States14 yrs 11 mos experience
Most Likely To SwitchAI ML Practitioner

Key Highlights

  • Led AI training infrastructure for Meta's GPU fleet
  • Developed open-source data processing engines at scale
  • Managed real-time analytics for Uber's critical operations
Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer specializing in AI and Data Processing.

Contact

Skills

Core Skills

Ai InfrastructureDistributed SystemsResource ManagementData ProcessingAi/ml InfrastructureAnalyticsData InfrastructureData Quality

Other Skills

C++AI TrainingOrchestrationJob SchedulingApache SparkData AnalyticsAI/ML ProcessingGenAI WorkloadsPrestoDistributed SQLHadoopData AnalysisMachine LearningBig DataData Science

About

Currently in Meta Super Intelligence Labs building the State of the Art Scheduler/Resource Manager/Orchestrator for AI training and Inference workloads for Meta handling one of the largest GPU fleet in the industry for Ads, Facebook, IG, Llama Engineering leader with 13+ yrs experience in Data Infra and AI Infra building and running Open Source Presto, Spark, Velox engines at Scale in Meta, Uber etc. MAST (https://www.usenix.org/system/files/osdi24-choudhury.pdf) is Meta's AI Training job scheduler. We build and run services for scheduling distributed ML training and inference jobs on Meta's internal GPU Compute infrastructure. We are at the heart of Meta's investment in AI. Every day, MAST schedules and starts hundreds of thousands of AI training jobs for all of our product groups and we will scale further. We deal with challenges such as gang scheduling, matching jobs with varying requirements to heterogeneous hardware, fairness, and constraint solving around data locality, regional capacity, and job priorities. As an essential piece of Meta Infrastructure of course we have to ensure service reliability, high SLA availability and service efficiency. Most of our code is in C++. Previously: 1. Supported large scale Analytics and AI/ML processing at Meta Scale The Spark team runs mission critical workloads across Data analytics and AI/ML businesses and expanding/evolving rapidly to support GenAI workloads on Meta's Data lake. This team is responsible for designing, building and supporting one of the largest data processing systems on the planet 2. Meta Open Sourced Presto Interactive (RaptorX) and Low Latency Analytics space (https://prestodb.io/). Distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes 3. Meta Open Sourced Velox (https://github.com/facebookincubator/velox). Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana

Experience

14 yrs 11 mos
Total Experience
2 yrs 11 mos
Average Tenure
6 yrs
Current Experience

Meta

Engineering leader

May 2020Present · 6 yrs · Menlo Park, California, United States

  • Supporting MAST. MAST (https://www.usenix.org/system/files/osdi24-choudhury.pdf) is Meta's AI Training job scheduler. We build and run services for scheduling distributed ML training and inference jobs on Meta's internal GPU Compute infrastructure. We are at the heart of Meta's investment in AI.
  • Every day, MAST schedules and starts hundreds of thousands of AI training jobs for all of our product groups and we will scale further.
  • We deal with challenges such as gang scheduling, matching jobs with varying requirements to heterogeneous hardware, fairness, and constraint solving around data locality, regional capacity, and job priorities.
  • As an essential piece of Meta Infrastructure of course we have to ensure service reliability, high SLA availability and service efficiency. Most of our code is in C++.
  • Previously:
  • 1. Supported large scale Analytics and AI/ML processing at Meta Scale
  • The Spark team runs mission critical workloads across Data analytics and AI/ML businesses and expanding/evolving rapidly to support GenAI workloads on Meta's Data lake. This team is responsible for designing, building and supporting one of the largest data processing systems on the planet
  • 2. Meta Open Sourced Presto Interactive (RaptorX) and Low Latency Analytics space (https://prestodb.io/). Distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes
  • 3. Meta Open Sourced Velox (https://github.com/facebookincubator/velox). Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana
C++AI TrainingResource ManagementOrchestrationDistributed SystemsAI Infrastructure

Uber

Engineering Manager

Oct 2015Apr 2020 · 4 yrs 6 mos · San Francisco Bay Area

  • Real Time Infrastructure and Analytics
  • Manage the Real-time Stream Processing/Analytics/Data Infra team at Uber. My team owns the platform that runs critical pipelines powering uber's real time business as well as analytical ones with high reliability and scale of Trillion messages processed per day. Uber Surge, Eats Dashboarding and Visualization, Machine Learning, Rider Experience and Driver Notifications etc are all powered by our platform. We nurture Apache Flink and have contributed back especially in the SQL APIs.
  • Lead in the Real time Streaming Platform team responsible for integrating Kakfa into the Uber ecosystem, scaling up to Trillions of messages a day and running it reliably powering uber's real time business.
  • Built AthenaX to power SQL on streams initially on Apache Samza later Apache Flink.
Apache SparkData AnalyticsAI/ML ProcessingGenAI WorkloadsData ProcessingAI/ML Infrastructure

Salesforce

Senior Member of Technical Staff

Dec 2012Oct 2015 · 2 yrs 10 mos · San Francisco Bay Area

  • I work on the Data Platform and Infrastructure R&D team responsible for building large scale systems to ensure data quality and for data analysis using hadoop, hbase and other technologies in the hadoop ecosystem. Working on migrating critical data processing flows to the new technology frameworks like storm, kafka, solrcloud etc. Work closely with data scientists developing, analyzing and productizing machine learning models.Developed API frameworks for exposing cloud-based data cleansing solutions.
PrestoDistributed SQLAnalyticsData Processing

Oracle

Member Technical Staff

Aug 2011Dec 2012 · 1 yr 4 mos · California

  • I worked in the Oracle Private Cloud Management Team on the integrated oracle cloud stack solutions including Infrastructure, Database, MiddleWare as a service.
HadoopData QualityData AnalysisMachine LearningData Infrastructure

Microsoft

Software Developer Intern

May 2010Aug 2010 · 3 mos

  • Worked on the Indirect Wi-Fi display feature built into Windows8.

Education

University of Illinois Urbana-Champaign

Masters — Computer Science

Jan 2009Jan 2011

Indian Institute of Technology, Guwahati

Bachelor's degree — Computer Science and Engineering

Jan 2005Jan 2009

Stackforce found 100+ more professionals with Ai Infrastructure & Distributed Systems

Explore similar profiles based on matching skills and experience