Naveen Cherukuri — CTO

Currently in Meta Super Intelligence Labs building the State of the Art Scheduler/Resource Manager/Orchestrator for AI training and Inference workloads for Meta handling one of the largest GPU fleet in the industry for Ads, Facebook, IG, Llama Engineering leader with 13+ yrs experience in Data Infra and AI Infra building and running Open Source Presto, Spark, Velox engines at Scale in Meta, Uber etc. MAST (https://www.usenix.org/system/files/osdi24-choudhury.pdf) is Meta's AI Training job scheduler. We build and run services for scheduling distributed ML training and inference jobs on Meta's internal GPU Compute infrastructure. We are at the heart of Meta's investment in AI. Every day, MAST schedules and starts hundreds of thousands of AI training jobs for all of our product groups and we will scale further. We deal with challenges such as gang scheduling, matching jobs with varying requirements to heterogeneous hardware, fairness, and constraint solving around data locality, regional capacity, and job priorities. As an essential piece of Meta Infrastructure of course we have to ensure service reliability, high SLA availability and service efficiency. Most of our code is in C++. Previously: 1. Supported large scale Analytics and AI/ML processing at Meta Scale The Spark team runs mission critical workloads across Data analytics and AI/ML businesses and expanding/evolving rapidly to support GenAI workloads on Meta's Data lake. This team is responsible for designing, building and supporting one of the largest data processing systems on the planet 2. Meta Open Sourced Presto Interactive (RaptorX) and Low Latency Analytics space (https://prestodb.io/). Distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes 3. Meta Open Sourced Velox (https://github.com/facebookincubator/velox). Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana

Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer specializing in AI and Data Processing.

Location: Sunnyvale, California, United States

Experience: 14 yrs 11 mos

Skills

Ai Infrastructure
Distributed Systems
Resource Management
Data Processing
Ai/ml Infrastructure
Analytics
Data Infrastructure
Data Quality

Career Highlights

Led AI training infrastructure for Meta's GPU fleet
Developed open-source data processing engines at scale
Managed real-time analytics for Uber's critical operations

Work Experience

Uber

Engineering Manager (4 yrs 6 mos)

Salesforce

Senior Member of Technical Staff (2 yrs 10 mos)

Oracle

Member Technical Staff (1 yr 4 mos)

Microsoft

Software Developer Intern (3 mos)

Education

Masters at University of Illinois Urbana-Champaign

Bachelor's degree at Indian Institute of Technology, Guwahati

Naveen Cherukuri

CTO

Sunnyvale, California, United States14 yrs 11 mos experience

Most Likely To SwitchAI ML Practitioner

Key Highlights

Led AI training infrastructure for Meta's GPU fleet
Developed open-source data processing engines at scale
Managed real-time analytics for Uber's critical operations

Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer specializing in AI and Data Processing.

Contact

Skills

Core Skills

Ai InfrastructureDistributed SystemsResource ManagementData ProcessingAi/ml InfrastructureAnalyticsData InfrastructureData Quality

Other Skills

C++AI TrainingOrchestrationJob SchedulingApache SparkData AnalyticsAI/ML ProcessingGenAI WorkloadsPrestoDistributed SQLHadoopData AnalysisMachine LearningBig DataData Science

About

Experience

14 yrs 11 mos

Total Experience

2 yrs 11 mos

Average Tenure

6 yrs

Current Experience

Uber

Engineering Manager

Oct 2015 – Apr 2020 · 4 yrs 6 mos · San Francisco Bay Area

Real Time Infrastructure and Analytics
Manage the Real-time Stream Processing/Analytics/Data Infra team at Uber. My team owns the platform that runs critical pipelines powering uber's real time business as well as analytical ones with high reliability and scale of Trillion messages processed per day. Uber Surge, Eats Dashboarding and Visualization, Machine Learning, Rider Experience and Driver Notifications etc are all powered by our platform. We nurture Apache Flink and have contributed back especially in the SQL APIs.
Lead in the Real time Streaming Platform team responsible for integrating Kakfa into the Uber ecosystem, scaling up to Trillions of messages a day and running it reliably powering uber's real time business.
Built AthenaX to power SQL on streams initially on Apache Samza later Apache Flink.

Apache SparkData AnalyticsAI/ML ProcessingGenAI WorkloadsData ProcessingAI/ML Infrastructure

Salesforce

Senior Member of Technical Staff

Dec 2012 – Oct 2015 · 2 yrs 10 mos · San Francisco Bay Area

I work on the Data Platform and Infrastructure R&D team responsible for building large scale systems to ensure data quality and for data analysis using hadoop, hbase and other technologies in the hadoop ecosystem. Working on migrating critical data processing flows to the new technology frameworks like storm, kafka, solrcloud etc. Work closely with data scientists developing, analyzing and productizing machine learning models.Developed API frameworks for exposing cloud-based data cleansing solutions.

PrestoDistributed SQLAnalyticsData Processing

Oracle

Member Technical Staff

Aug 2011 – Dec 2012 · 1 yr 4 mos · California

I worked in the Oracle Private Cloud Management Team on the integrated oracle cloud stack solutions including Infrastructure, Database, MiddleWare as a service.

HadoopData QualityData AnalysisMachine LearningData Infrastructure