Omar Baldonado — CTO
I manage the groups that develop and operate Meta's data center network that support Meta's family of apps (Meta AI, Facebook, Instagram, WhatsApp, Messenger) and our AI models. We have developed some of the largest AI clusters in the world (129K GPUs in 2024), with more gigawatt-scale clusters coming. We are hiring!! We are looking for ICs & managers for multiple groups (hardware/software/network engineers and TPMs). Our groups include : * the overall network topologies & control stack; * the networking switches/NOS (FBOSS); * host-based networking (NICs, eBPF, transport/congestion control, RoCE...) * AI-specific teams working on communication libraries and performance optimization. Our work spans the entire network lifecycle: * hardware/software/network engineering; * topology design & capacity planning; * distributed protocols & centralized control; * provisioning & delivery workflows; * monitoring, debugging, & analytics; * performance benchmarking & tuning Some highlights of our work: * Developing TorchComms as a new high-performance AI networking layer: https://pytorch.org/blog/torchcomms/ * Non-Scheduled Fabric (NSF) for AI scale-out and Ethernet for Scale-Up Networking: https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/ * Building large-scale (100K+ GPUs) AI clusters based on RoCE: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ * Distributed Scheduled Fabric (DSF) for AI clusters; and our first network ASIC for our FBNIC: https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/ * NetEdit: An Orchestration Platform for eBPF Network Functions at Scale: https://dl.acm.org/doi/10.1145/3651890.3672227 * DCTCP at scale: https://www.usenix.org/conference/nsdi24/presentation/dhamija * FBOSS, our Network Operating System (NOS) for our DC network switches: https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/ and https://engineering.fb.com/2021/11/09/data-center-engineering/ocp-summit-2021/ * BGP and Open/R routing protocols in our DCs and WAN: https://research.fb.com/publications/running-bgp-in-data-centers-at-scale/ Learn more about AI networking through the networking @scale conferences that we host: * https://engineering.fb.com/2025/09/26/networking-traffic/networking-at-the-heart-of-ai-scale-networking-2025-recap/ * https://atscaleconference.com/events/networking-scale-2024/ * https://atscaleconference.com/events/networking-scale-2023/
Stackforce AI infers this person is a leader in AI networking and data center infrastructure.
Location: Palo Alto, California, United States
Experience: 31 yrs 11 mos
Skills
- Data Center
- Networking
- Product Management
Career Highlights
- Expert in managing large-scale AI networking projects
- Pioneered open-source networking hardware initiatives
- Led development of high-performance AI clusters
Work Experience
Meta
Senior Director, Data Center & AI Networking (7 yrs 1 mo)
Director, DC Networking / Net Systems (5 yrs 7 mos)
Open Compute Project
OCP Networking Project Co-Lead (8 yrs)
Big Switch Networks
Head of Product Management (2 yrs 7 mos)
Social startup
Founder (1 yr)
Avaya
Director of System Management/Assured Networks (4 yrs)
Routescience Technologies
R&D Director (4 yrs)
Cisco Systems
Engineering Manager (4 yrs)
NETSYS Technologies
Senior Software Engineer (1 yr 8 mos)
Make Systems
Software Engineer (2 yrs 8 mos)
Education
BS at Stanford University