Naveen Cherukuri — CTO
Currently in Meta Super Intelligence Labs building the State of the Art Scheduler/Resource Manager/Orchestrator for AI training and Inference workloads for Meta handling one of the largest GPU fleet in the industry for Ads, Facebook, IG, Llama Engineering leader with 13+ yrs experience in Data Infra and AI Infra building and running Open Source Presto, Spark, Velox engines at Scale in Meta, Uber etc. MAST (https://www.usenix.org/system/files/osdi24-choudhury.pdf) is Meta's AI Training job scheduler. We build and run services for scheduling distributed ML training and inference jobs on Meta's internal GPU Compute infrastructure. We are at the heart of Meta's investment in AI. Every day, MAST schedules and starts hundreds of thousands of AI training jobs for all of our product groups and we will scale further. We deal with challenges such as gang scheduling, matching jobs with varying requirements to heterogeneous hardware, fairness, and constraint solving around data locality, regional capacity, and job priorities. As an essential piece of Meta Infrastructure of course we have to ensure service reliability, high SLA availability and service efficiency. Most of our code is in C++. Previously: 1. Supported large scale Analytics and AI/ML processing at Meta Scale The Spark team runs mission critical workloads across Data analytics and AI/ML businesses and expanding/evolving rapidly to support GenAI workloads on Meta's Data lake. This team is responsible for designing, building and supporting one of the largest data processing systems on the planet 2. Meta Open Sourced Presto Interactive (RaptorX) and Low Latency Analytics space (https://prestodb.io/). Distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes 3. Meta Open Sourced Velox (https://github.com/facebookincubator/velox). Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana
Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer specializing in AI and Data Processing.
Location: Sunnyvale, California, United States
Experience: 14 yrs 11 mos
Skills
- Ai Infrastructure
- Distributed Systems
- Resource Management
- Data Processing
- Ai/ml Infrastructure
- Analytics
- Data Infrastructure
- Data Quality
Career Highlights
- Led AI training infrastructure for Meta's GPU fleet
- Developed open-source data processing engines at scale
- Managed real-time analytics for Uber's critical operations
Work Experience
Meta
Engineering leader (6 yrs)
Uber
Engineering Manager (4 yrs 6 mos)
Salesforce
Senior Member of Technical Staff (2 yrs 10 mos)
Oracle
Member Technical Staff (1 yr 4 mos)
Microsoft
Software Developer Intern (3 mos)
Education
Masters at University of Illinois Urbana-Champaign
Bachelor's degree at Indian Institute of Technology, Guwahati