vaibhav sharma

Software Engineer

Seattle, Washington, United States17 yrs 9 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Pioneered the first public cloud platform at AWS.
  • Architected core scenarios for AWS Elastic Block Store.
  • Designed a global Azure IoT platform for high availability.
Stackforce AI infers this person is a Cloud Computing Architect with extensive experience in distributed systems and AI workloads.

Contact

Skills

Core Skills

Cloud ComputingDistributed SystemsDatabase ApplicationsProject Management

Other Skills

AI WorkloadsAgile MethodologiesAlgorithmsHadoopJavaObject Oriented DesignPerlScalability

About

Built the first ever public cloud platform at AWS. Over a decade of experience in taking cloud services from conception to launch. Architected core scenarios in the next generation of AWS Elastic Block Store (EBS) with Storage Area Network (SAN) design. Designed a highly available, global Azure IoT platform.

Experience

Amazon web services (aws)

2 roles

Principal Software Engineer

Promoted

Aug 2022Present · 3 yrs 7 mos · Greater Seattle Area

Cloud ComputingDistributed SystemsProject Management

Senior Software Engineer

Jul 2008Jun 2017 · 8 yrs 11 mos · Greater Seattle Area

  • Designed and led the implementation of core components in the next generation of AWS Elastic Block Store (EBS) with a Storage Area Network (SAN) architecture. I led the team to implement the hardware bootstrape, drive allocation, failure detection, and data reconstruction (with RAID). The project reduced hardware costs by 40% while improving the data durability as well as the IO performance.
  • Led the team to build transactional, temporal (append-only), multi-tenant, high-scale, high-availability database on top of DynamoDb. The solution is designed to work on any other key-value store as well. It scales for the read-heavy transactional workload of AWS platform data while meeting its strict security and auditability.
  • Led the design, implementation and data migration for the next gen AWS platform. I built the AWS commerce platform workflow engine to process the massive number of metering records for computing bills and usage reports. I led the team to build a fault-tolerant, secure, bill storage solution. And then, I led the migration of highly sensitive billing and customer profile data with the associated workflows to the new platform.
Cloud ComputingDistributed SystemsProject ManagementDatabase Applications

Microsoft

Principal Group manager

Jun 2017Aug 2022 · 5 yrs 2 mos · Redmond, Washington

  • Led dataplane implementation for the Singularity project under Azure CTO Office. (Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads) https://arxiv.org/pdf/2202.07848.pdf . Designed and implemented many core scheduling, preemption and restore features. The singularity is the new backend AI infrastructure my team at Microsoft built that hosts the workload for github copilot, office copilot, Azure AI and other AI products offered by Microsoft.
  • Designed and led the implementation of a distributed in-memory database (named Magnet db) that runs on top of Azure blob store with a plugable replication protocol. It enabled the Azure IoT platform to scale for the massive transactional workload handling millions of IOPS.
  • Architected a one-click customer-initiated cross-region failover for Azure IoT. The feature enables self-service disaster handling for Azure customers with strict RTO (Recovery Time Objective) and RPO (Recovery Point Objective) guarantees.
  • Led the team to automate performance prediction, isolation, and recovery for the Azure IoT message routing system. Previously, one unhealthy (slow or stuck) routing endpoint used to disrupt message flow to other healthy endpoints. The new solution eminiated the cross endpoint impacts, while maintaining liveness and safety throughout (no data loss or downtime).
  • Conducted workshops with Dr. Leslie Lamport to ramp-up my team on formal methods (TLA+). With an increasing scale, the Azure IoT team was starting to see a new set of race conditions and failure modes. TLA+ helped the team in eliminating these issues.
Cloud ComputingDistributed SystemsProject ManagementAI WorkloadsDatabase Applications

Education

Indian Institute of Technology, Roorkee

Master of Technology - MTech — Computer Science

Jan 2007Jan 2008

Indian Institute of Technology, Roorkee

Bachelor of Technology - BTech — Computer Science and Technology

Jan 2003Jan 2007

Stackforce found 100+ more professionals with Cloud Computing & Distributed Systems

Explore similar profiles based on matching skills and experience