N

Nitin Kesarwani

DevOps Engineer

San Francisco, California, United States16 yrs 10 mos experience
AI EnabledAI ML Practitioner

Key Highlights

  • 15+ years in data and ML infrastructure design.
  • Expert in building scalable cloud solutions.
  • Proven leadership in autonomous vehicle projects.
Stackforce AI infers this person is a Cloud Computing and Autonomous Vehicles expert with extensive experience in scalable infrastructure.

Contact

Skills

Core Skills

Machine LearningAmazon Web Services (aws)GraphqlJavaSelf DrivingSocial Networking SoftwareA/b TestingAnomaly DetectionWorkflow SoftwareOracle Fusion MiddlewareOracle DatabaseOperating SystemsRealtime Linux

Other Skills

AJAXActiveMQAgentic AIAirflowAlgorithmsAndroid DevelopmentApache SparkCC++CassandraCockroachDBCore JavaData StructuresDatabasesDeep Learning

About

Staff Engineer with 15+ years of experience designing and scaling petabyte-scale data and ML infrastructure. My career has focused on applying foundational principles of distributed systems to solve high-stakes business problems in cloud computing, social media products, L4 autonomy, and personalized e-commerce search.

Experience

16 yrs 10 mos
Total Experience
2 yrs 5 mos
Average Tenure
2 yrs 5 mos
Current Experience

Coupang

Staff Machine Learning Infra Engineer II

Jan 2024Present · 2 yrs 5 mos · Mountain View, California, United States · On-site

  • (Product Search & Discovery infrastructure)
  • I am part of the Core Infrastructure team that designs, builds, and operates the foundational systems powering product discovery and personalization. Our responsibilities cover the entire pipeline, from ingesting product data to serving it through keyword-based and ANN-based retrieval. We extract features from raw product information and manage them using scalable feature stores. We own the complete query serving stack, along with all CPU and GPU inference infrastructure supporting both light and heavy ranking models. In addition to search, we develop and maintain the infrastructure for query rewriting, autocomplete, feeds, and recommendations. This includes customer data pipelines, embedding generation, and model serving to enable real-time personalized experiences across the platform.
LucenePythonJavaMachine LearningSemantic SearchApache Spark+7

Cloudkitchens

Software Engineer L6

May 2021Feb 2024 · 2 yrs 9 mos · Los Angeles, California, United States · On-site

  • (Product platform, Money Platform and Online Food Ordering App Integration Platform)
  • Worked on following multi year, strategic initiatives.
  • Led redesign of CloudKitchens Facility platform. Merged internal real estate management software with facility operations needs for Otter (US), FMS (China) and CloudRetail to build a single source of truth system. Alongside, rearchitected real estate planning backend subsystem. Associated Search Index subsystem.
  • Improved DevEx of product engineers via GraphQL Federation and Codegen. Set the vision for decomposition of multiple GraphQL monoliths in the company in favor of federated mico-GraphQL gateways per product team/line, improving build and release of respective subgraphs. We built Netflix DGS style ecosystem of offerings, used Wundergraph library for federation and provided developer annotations for reduced boilerplate service code generation. Alongside, we also built custom GraphQL schema to codegen library, improving product development speed all across the company.
  • Money Platform P&L reporting, yearly subscriptions and late payment fees.
  • Store-health connection availability check subsystem.
  • Led design and development of a sharded, replicated and multi-regional stateful service backed by active-active DB that minimizes errors' blast radius, ensuring high availability across regions for Order Import and Customer onboarding operation. Work mimics the idea described here: https://research.google/pubs/fast-key-value-stores-an-idea-whose-time-has-come-and-gone/
JavaKubernetesMicrosoft AzureGraphQLGo (Programming Language)CockroachDB+2

Argo ai

Staff Software Engineer

Aug 2018May 2021 · 2 yrs 9 mos · Palo Alto, California, United States · On-site

  • (ReSim Infra Architect - Verification & Validation stack for L4 Autonomy)
  • Cross-functional Technical Lead for multiple Platform & Services team focusing on ingestion of data from vehicles into the cloud, auto-generation of interesting events from a vehicle trip, ingesting these events into a search system that can answer complex geospatial queries with varying actors, labeling and indexing presence of various actors in the scene, synthesis of simulation scenarios feeding off from vehicle safety guidelines and realtime events. These services formed the backbone of our company and empowered Autonomy teams to efficiently manage self-driving vehicle logs and execute Autonomy workflows in the cloud. Directly worked and scaled the teams focused on following initiatives:
  • Ingestion of data from vehicle to cloud.
  • Processing via Spark clusters.
  • Log slicing service.
  • Triaging and video generation subsystem
  • Log event indexing in context of various actors in the scene
  • Datasets for various Autonomy usecases
  • Simulation subsystem
  • Replay simulation subsystem
  • 2D and 3D Labeling pipelines
  • Fault injection, Property Based testing and CI gating
  • The above surface area maps to 6 teams in Platform & Services handling requests from 30+ Autonomy teams. Each month we ingest PBs of data from our fleet amounting to millions of real scenarios. Autonomy teams find and bundle similar scenarios together into Datasets for their feature development and CI gating. A complex use-case would involve an Autonomy user asking for all scenarios where a pedestrian, bicyclist and a horse were within 5 meters from the trajectory of car near a particular intersection to build a Dataset so they could run a new Motion Planning algorithm without having to release it fully on the car thereby expediting development cycle.
  • By the time I transitioned from my role, the systems I helped build successfully managed simulation and replay over 100s of PBs of vehicle logs and millions of real scenarios.
C++Amazon Web Services (AWS)Apache SparkJavaKubernetesElasticsearch+6

Quora

Software Engineer

Sep 2017Aug 2018 · 11 mos · Mountain View, California, United States

  • (ML Platform, News Feed and Experimentation group)
  • Helped with rewriting pieces of news feed serving for scalability initiatives.
  • Improved disk throughput performance of the ML cluster, enabling faster execution of longer jobs with larger datasets.
  • Contributed to an ongoing effort to migrate the feed serving to a more scalable data store (HBase).
  • Helped with rewriting A/B testing framework, enhancing its functionality and efficiency.
  • Improved availability and reliability of data ingestion pipelines for all of experimentations, achieving a 99.9% reliability compared to the previous 75% under high load situations.
  • Worked on prototype of clustered Presto setup and a unified data lake architecture, paving the way for more scalable and advanced analytics capabilities.
A/B TestingHBaseSocial Networking SoftwareJavaAirflowNews Feed

Amazon web services (aws)

Senior Software Development Engineer

Dec 2013Sep 2017 · 3 yrs 9 mos · Greater Seattle Area

  • (Cloudwatch Metrics Anomaly Detection)
  • Lead engineer on Amazon-wide Metric Anomaly Detection Service (4 SDEs, 2 Applied Scientists):
  • Redesign and migration of predictive analysis for time series data across four teams, collaborating with 100+ stakeholders.
  • Led the migration of an existing service to 10 AWS regions.
  • Public launch: https://aws.amazon.com/about-aws/whats-new/2019/07/introducing-amazon-cloudwatch-anomaly-detection-now-in-preview/
  • (Cloudwatch Logs and Logs Insights)
  • One of the 3 lead engineers (30 SDEs) on Amazon-wide Logging and Log Analytics team:
  • Launch of Subscriptions and cross-account subscriptions, enabling real-time intrusion detection at scale. https://aws.amazon.com/blogs/aws/cloudwatch-logs-subscription-consumer-elasticsearch-kibana-dashboards/
  • Contributed as a codeveloper to Export to S3, GetLogEvents and FilterLogEvents features.
  • Innovated a full-Continuous-Deployment scheme, marking the first multi-region CD pipeline with zero human intervention in AWS.
  • Vended logging for AWS services. https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-cloudwatch-adds-route53-logs-to-vended-logs/
  • Led redesign of clustered architecture for CloudWatchLogs. Resulted in a PoA talk.
  • Founding team member for AWS CloudWatch Insights: https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-cloudwatch-logs-insights-fast-interactive-log-analytics/
  • Cost optimization while accommodating substantial growth by optimizing storage and archival infrastructure.
  • Built Logs service into three new AWS regions.
  • Self-healing systems to reduce escalations by over 1000 per year. Cost-per-release framework.
  • Lead CloudWatch Samurai group to foster cross-organization collaboration.
  • Mentored 5 interns and 3 junior engineers, resulting in full-time offers and promotions.
  • AWS Bar Raiser. Mentored other interviewers through the program.
  • Granted 9 patents to date in Cloud Computing and Predictive Machine Learning technologies.
Core JavaA/B TestingAmazon Web Services (AWS)Oracle DatabaseScalabilityLogging+2

Amazon.com

Software Development Engineer

May 2011Dec 2013 · 2 yrs 7 mos · New York City Metropolitan Area

  • (Flash sale business now part of Amazon Fashion)
  • During my tenure as a member of the MYHABIT backend development team from 2011 to 2014, I made contributions aimed at optimizing workflows, enhancing security, and facilitating data-driven decision-making, ultimately strengthening MYHABIT's position in the market. Initially based in Bangalore, I transitioned to New York City as a part of this team.
  • MyHabit daily business report summarizing focus of the day
  • MyHabit Dropship and Consignment subsystems
  • MyHabit Vendor Central Analytics subsystem
  • Retail Service Platform based workflows handling Copy Editing
  • Copy Editing and Image Upload based daily reporting
  • Granular ASIN based lifecycle tracking subsystem and auditing
  • Distributed Quartz based scheduler to manage timed jobs for MyHabit business
  • BuyVIP Copy Editing, Image Upload and Analytics
  • MyHabit Purchase Order lifecycle tracking
  • Rule Based Access Control subsystem for MyHabit internal users
Core JavaWorkflow SoftwareAmazon Web Services (AWS)

Oracle

Member of Technical Staff

Aug 2009May 2011 · 1 yr 9 mos · Bengaluru, Karnataka, India · On-site

  • (Oracle Beehive and Oracle Fusion Middleware)
  • Within the Oracle Beehive suite, I made contributions to the development of the core framework that touched email systems, chat functionality, syndication tools, Microsoft Exchange connector integrations and the critical DMZ infrastructure.
  • Within the Oracle DataLens suite, I made contributions towards integrating and launching newly acquired DataLens technology within the Fusion Middleware stack.
Oracle Fusion MiddlewareStorageOracle DatalensOracle BeehiveOracle DatabaseNetworking+1

Ibm

Intern at Linux Technology Center

May 2008Aug 2008 · 3 mos · Bengaluru, Karnataka, India · On-site

  • (Real time Linux Kernel development team)
  • (Secured a return job offer)
  • Worked under the mentorship of the RT-Linux and SystemTap teams on improving real-time diagnostics for the PREEMPT_RT kernel. Contributed to the development of a suite of SystemTap tapsets used to analyze and troubleshoot latency, scheduling, and interrupt behavior in real-time Linux systems.
  • Developed tapsets to monitor futex contention, CPU migration, preemptions, runqueue delays, rtmutex priority inheritance, missed wakeups, and more
  • Worked closely with kernel static markers and debugging tools to trace subtle real-time performance issues
  • Continued this work as part of the undergraduate thesis, exploring applicability in virtualized real-time environments
  • Secured return offer based on performance.
Operating SystemsRealtime LinuxSystemTapLinux Kernel

Education

National Institute of Technology Karnataka

Bachelor of Technology (B.Tech.) — Computer Science

Aug 2005May 2009

Stackforce found 100+ more professionals with Machine Learning & Amazon Web Services (aws)

Explore similar profiles based on matching skills and experience