Decheng Dai

CEO

Sunnyvale, California, United States16 yrs 8 mos experience

Highly Stable

Key Highlights

Over 18 years of experience in cloud infrastructure.
Led teams managing thousands of Kubernetes clusters.
Designed scalable solutions for major cloud providers.

Stackforce AI infers this person is a Cloud Infrastructure Expert with extensive experience in SaaS and distributed systems.

Contact

Skills

Core Skills

Cloud ComputingDistributed SystemsCluster Management

Other Skills

KubernetesStorageNetworkingVirtualizationSecurityInfrastructure ManagementSchedulingData Center ManagementProject ManagementEngineeringSoftware Project ManagementShell ScriptingProject PlanningReliabilityC++

About

With over 18 years of experience in computer science, I am a passionate and visionary leader in cloud infrastructure and distributed systems. My mission is to foresee the technical directions and scale the infrastructure for cloud products, which offer reliable and scalable solutions for data streaming and processing. I bring diverse perspectives and experiences to the team, as I have worked across teams in different domains, platforms, and regions, and have a strong academic background with a PhD from Tsinghua University. As a senior principal engineer at Confluent, I am responsible for the cloud infrastructure that serves Kafka, Kafka Connect, Flink jobs in nearly a hundred regions in AWS, Azure, and GCP. I focus on the Cloud infrastructure for Kubernetes, storage, networking, virtualization, and security, and ensure that our infrastructure offers continuous deployments, autoscaling, disaster recovery, and strong security. I leverage my skills in cloud computing and distributed systems to design and implement innovative and robust solutions for all Confluent's cloud customers. I also managed a team of 60 engineers of the core Control Plane in Confluent.

Experience

16 yrs 8 mos

Total Experience

7 yrs 5 mos

Average Tenure

1 yr 9 mos

Current Experience

Anthropic

Member of Technical Staff

Aug 2024 – Present · 1 yr 9 mos · San Francisco, California, United States · Hybrid

Confluent

2 roles

Senior Principal Engineer, Lead of Cloud Infrastructure and Platform

Jan 2023 – Aug 2024 · 1 yr 7 mos · Mountain View, California, United States

I am responsible for the technical directions for Confluent's Cloud infrastructure and platform across all product. Focus on the control plane and infrastructure for Compute, Storage, Networking, Virtualization. My team is also responsible for internal services management, and runtime. We manage a thousands of Kubernetes clusters in all major regions of AWS, Azure, and GCP. We operate the largest number of Kafka clusters and Kafka Connect amongst the industry. We optimize for service availability, disaster recovery, data durability, and the automation for them.

Cloud ComputingDistributed SystemsKubernetesStorageNetworkingVirtualization+1

Senior Director, Head of Control Plane, Engineering

Mar 2021 – Jan 2023 · 1 yr 10 mos · Mountain View, California, United States

I was head of Control Plane of Confluent Cloud (both management and tech-leading). We built architecture that manages computing resource, secrets, certificates, and their life cycles across all Confluent's Cloud products as well as most of Confluent's internal infrastructure. We built scalable and highly-available infrastructure that spans across tens of thousands of nodes on AWS, Azure, and GCP.
At the time I stepped out of management and became an IC, I managed an org with close to 60 FTE engineers at all levels.

Cloud ComputingDistributed SystemsInfrastructure Management

Google

2 roles

Staff Engineer / Senior Staff Engineer

Promoted

Aug 2016 – Mar 2021 · 4 yrs 7 mos

I managed the Borg control plane team (aka Borgmaster) in Google Infrastructure We were a team of 50 engineers, responsible for cluster management for all of Google's services (Google Brain, Search, GMail, Youtube, ... everything you know about).
We are one of the largest (and most successful) cluster management systems in the industry. We managed hundreds of datacenters, millions of physical machines, and billions of containers.
I led and designed the initial vision “Borg for ML”. We scheduled all Google's Tensorflow and all other ML pipelines. We specifically optimize for the efficiency of Google TPU, Nvidia GPU (T100, P100, V100) with affinity scheduling, CPU/RAM optimization, and machine sizing techniques. Our work is one of the most critical infrastructure for all Machine Learning efforts in Google.

SchedulingCloud ComputingCluster Management

Staff Software Engineer

Sep 2009 – Aug 2016 · 6 yrs 11 mos

I held Software Engineering positions in multiple teams in Google Inc. From 2013 to 2016, I led the conversion tracking products in Google Adwords Express. From 2009 to 2013, I held a Software Engineer position in Google Research for an AI-powered Q&A product.