Purva Gaikwad

Software Engineer

Sunnyvale, California, United States2 yrs experience

Key Highlights

  • Expert in building distributed systems at scale.
  • Proven track record in MLOps and cloud infrastructure.
  • Highest rated Teaching Assistant at UCLA.
Stackforce AI infers this person is a Cloud Infrastructure and Machine Learning expert with a focus on distributed systems.

Contact

Skills

Core Skills

Distributed SystemsCloud InfrastructureTeachingMachine LearningMlopsSoftware Development

Other Skills

Failure DetectionRecovery OrchestrationObservabilityMathematicsCommunicationLeadershipAutoGluonAmazon SageMakerPredictive AutoscalingJava-based OrchestrationJavaFastAPIDockerKubernetesKubeflow

About

I’m a Software Engineer building large-scale distributed systems and cloud infrastructure. My focus is on high availability, fault tolerance, low latency and reliable system behavior at scale. I enjoy owning problems end to end, leading initiatives, and collaborating across teams. I also enjoy mentoring, sharing knowledge, and learning continuously. PS: I was the highest rated Teaching assistant at UCLA for my ease of explaining complex topics and great communication skills. I currently work on the Aurora Control Plane at AWS, where I design and build core services for distributed databases. My work includes accurate failure detection, recovery orchestration, and observability across regions and availability zones. I help systems handle network partitions safely, make quorum-based decisions, and recover quickly from failures. I also participate in on-call rotations, triage Sev2/Sev3 incidents and SLA violations, and build tools and dashboards that reduce alert noise and operational toil. Before joining AWS full-time, I interned on the same team. I built a predictive autoscaling pipeline using AutoGluon and Amazon SageMaker. This reduced scaling lag from about 30 minutes to under 5 minutes during traffic spikes. I integrated ML forecasts into safe, idempotent control-plane workflows used in production. Earlier, I worked in ML Operations and ML Infrastructure at Syngenta. I built and operated production ML inference services using FastAPI, Docker, and Kubernetes. I designed reusable ML pipelines with Kubeflow, deployed models with KServe, and orchestrated microservices using Cadence. These changes reduced end-to-end inference latency by ~95% and improved system reliability. I also contributed to Syngenta’s engineering blog on real-world ML challenges. I hold an M.S. in Computer Science from UCLA, where I focused my studies on distributed systems, cloud systems, and large-scale backend design. I’m interested in roles involving distributed systems, cloud infrastructure, ML platforms, inference pipelines, and backend platform engineering. 📩 Reach me at purvag11.11@gmail.com

Experience

2 yrs
Total Experience
1 yr
Average Tenure
1 yr
Current Experience

Amazon web services (aws)

2 roles

Software Development Engineer – Distributed Systems & Control Plane (AWS Aurora)

May 2025Present · 1 yr · East Palo Alto, CA · On-site

  • Enhance and contribute to core control-plane services for AWS Aurora Limitless, responsible for failure detection, recovery orchestration, and observability across distributed databases operating at cloud scale.
  • Implement network-partition–aware detection and quorum-based decisioning to prevent unsafe failover actions and improve recovery correctness under partial failures.
  • Improve on-call signal quality by reducing operational noise and surfacing actionable failure signals through observability and alerting improvements.
  • Own oncall responsibilities for the Aurora Limitless control plane, ramping to solo oncall within ~3 months, and actively triaging Sev2/Sev3 incidents and SLA violation tickets.
  • Lead deep-dive root cause analyses for recurring SLA violations and host replacement incidents, identifying infrastructure-level failure patterns and documenting findings as new SOPs.
  • Drive ticket hygiene and oncall toil reduction by tuning alert thresholds, suppressing low-value signals, bulk-cleaning stale tickets, and building dashboards and tooling that surface actionable failure signals.
  • Contribute to knowledge sharing and onboarding by running shadow/reverse-shadow sessions, creating oncall readiness checklists, and helping new engineers ramp safely and efficiently.
Distributed SystemsCloud InfrastructureFailure DetectionRecovery OrchestrationObservability

Software Development Engineer Intern – Distributed Systems & ML Infrastructure (AWS Aurora)

Jun 2024Sep 2024 · 3 mos · East Palo Alto, CA · On-site

  • Built a predictive autoscaling pipeline using AutoGluon and Amazon SageMaker for the AWS Aurora Limitless control plane, combining time-series forecasting with reactive system signals to reduce resource scaling lag from ~30 minutes to ~5 minutes under bursty workloads.
  • Integrated ML forecasts into safe, idempotent Java-based orchestration workflows, enabling automated resource scaling decisions and improving system scalability while reducing downtime during demand spikes.
AutoGluonAmazon SageMakerPredictive AutoscalingJava-based OrchestrationMachine LearningCloud Infrastructure

Ucla

Teaching Assistant

Jan 2024Mar 2025 · 1 yr 2 mos · Los Angeles County, California, United States · On-site

  • Assisted in teaching
  • 1. MATH 31B: Integration and Infinite Series, Winter 2024
  • 2. MATH 32B: Calculus of Several Variables, Spring 2024
  • 3. MATH 31AL & Math 31B, Fall 2024
  • 4. ENGR 186W: Ethics for Computer Scientist, Winter 2025
MathematicsCommunicationLeadershipTeaching

Syngenta

Software Developer - MLOps

Aug 2022Aug 2023 · 1 yr · Pune

  • Built and operated production ML inference services using FastAPI, Docker, and Kubernetes, supporting high-throughput, real-time workloads as part of ML operations.
  • Designed reusable ML pipelines for model training, testing, and deployment using Kubeflow, and deployed models using KServe for scalable, production-grade model serving.
  • Implemented Cadence-based workflow orchestration to coordinate ML inference services and downstream microservices, enabling fault-tolerant execution and parallelism across the ML operations stack.
  • Improved end-to-end inference latency by ~95% through batch processing and parallel execution techniques.
  • Integrated logging and monitoring for deployed ML services using Datadog, improving debuggability and operational visibility in production.
  • Authored and contributed technical blogs on ML operations and production challenges for Syngenta Digital, including:
  • From Brainwave to Machine Learning Grave: Challenges Faced by ML Models
  • https://medium.com/syngenta-digitalblog/from-brainwave-to-machine-learning-grave-challenges-faced-by-ml-machine-learning-models-from-dff73d8234f3
FastAPIDockerKubernetesKubeflowKServeCadence+3

Ideas revenue solutions

Associate Software Developer

Jan 2022Jul 2022 · 6 mos · Pune

  • Designed and implemented a Redis-backed caching proof of concept for the Revenue Management System (RMS), improving request latency and overall system performance by ~40%.
  • Built backend features and application pages using Java and Angular, improving stakeholder access to operational and analytical data within the RMS.
  • Developed optimized log search and Excel export tooling for system logs, improving internal debugging, testing workflows, and developer efficiency by ~15%.
  • Gained hands-on experience with enterprise backend systems, contributing to feature development, testing, and quality assurance across the RMS stack.
RedisJavaAngularTest DesignSoftware Development

Vishwakarma institute of information technology

Research Collaborator

Jul 2021Dec 2021 · 5 mos · Pune, Maharashtra, India

  • Worked on a research project titled "Analysis and prediction of soil nutrients for crop using machine learning classifier"

Syngenta

Software Development Intern

Jul 2021Dec 2021 · 5 mos · Pune

  • It was an amazing experience to work as an intern with the Digital Product Engineering (DPE) team at Syngenta. During my tenure I worked on projects like "Feature Store for MacWorked with the DPE team on ML infrastructure initiatives, including building a feature store for machine learning pipelines to support consistent and reusable feature computation.
  • Implemented Cadence-based microservice orchestration, coordinating distributed services and enabling reliable, parallel execution of backend workflows.
  • Presented project outcomes in a global technical forum, communicating system design, trade-offs, and results to engineers and interns across teams.hine Learning pipelines" and "Uber cadence - Microservice Orchestration".
  • It was an amazing journey full of new and challenging learning path
  • I also gave a global presentation of my work infront of members and interns all over the globe enhancing my communication and presentation skills

Machine learning forum viit

ML Forum Core Team

Jan 2020Jan 2021 · 1 yr · Pune

  • Conducted various seminars for students and actively contributed to the learning and sharing process. Spearheaded the Cuda session and creation of YouTube Playlist to learn Python.
Leadership

Dlrd - deep learning research & development

Technical Videography Intern

Jun 2019Sep 2019 · 3 mos · Pune

Education

UCLA Henry Samueli School of Engineering and Applied Science

Master of Science - MS — Computer Science

Sep 2023Present

Vishwakarma Institute of Information Technology

Bachelor of Technology - BTech — Computer Engineering

Jan 2018Jan 2022

Bharat Children's Academy And Jr. College

HSC — science

Jan 2016Jan 2018

Stackforce found 100+ more professionals with Distributed Systems & Cloud Infrastructure

Explore similar profiles based on matching skills and experience