Tanmay Sahay

DevOps Engineer

Mountain View, California, United States8 yrs 11 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Expert in building reliable distributed systems at scale.
  • Pioneered automated incident troubleshooting tools at Google.
  • Led significant improvements in monitoring and alerting systems.
Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer with expertise in SRE and distributed systems.

Contact

Skills

Core Skills

System MonitoringAutomationArtificial Intelligence (ai)Natural Language Processing (nlp)NetworkingTraffic EngineeringContainerizationStakeholder ManagementAutomated AlertsData AnalysisRelease ManagementProblem Solving

Other Skills

Anomaly DetectionProject ManagementAlertingIncident ResponseCapacity ManagementInfrastructure as code (IaC)CommunicationAutonomous Systems (Internet)Border Gateway Protocol (BGP)IS-ISTypeScriptGrafanaInferenceC++Algorithms

About

Read more at https://tanmaysahay.com I have an interest in architecting and building large scale distributed systems running at their peak; and having reliability automated away as much as possible. In a domain where failure is inevitable, continuously learning from it keeps me going.

Experience

8 yrs 11 mos
Total Experience
4 yrs 5 mos
Average Tenure
7 yrs 3 mos
Current Experience

Google

7 roles

Software Engineer, Distributed Systems

Feb 2026Present · 4 mos

  • Building your autonomous builder swarm.

Software Engineer Site Reliability Engineering - Distributed Systems

Apr 2025Feb 2026 · 10 mos

  • ML Inference @ Google (Cloud 🫱🏼‍🫲🏾 Gemini 🟰 Vertex AI)
  • Making LLM serving/inference for Gemini delightfully reliable & reliably delightful at a planet-scale.
  • Oh, and economical too.
System MonitoringAlertingArtificial Intelligence (AI)Natural Language Processing (NLP)Incident ResponseAnomaly Detection+2

Software Engineer Site Reliability Engineering - Distributed Systems

Feb 2024Mar 2025 · 1 yr 1 mo

  • WAN @ Google
  • Keep Google visible to the world.
  • How? Manage, at scale, peering technologies & intermediate systems via SoTA observability & mitigation systems.
  • Why? To enable you to watch cat videos while simultaneously allowing you to build a variety of workloads in the cloud (maybe ML inference, or anything else for that matter). This is the forcing function ensuring Google handles multitudes of traffic and workload types.
ContainerizationCommunicationProblem SolvingNetworkingAutonomous Systems (Internet)Border Gateway Protocol (BGP)+3

Software Engineer, Site Reliability Engineering - Distributed Systems

May 2023Feb 2024 · 9 mos

  • Monitoring and alerting for GCP and Google-internal services.
  • Successfully migrated all internal users from 2 disparate tools (15-year old (v1) and 5-year old (v2)) to the next-gen alert visualisation tool (v3) with zero downtime. As a result, I trail-blazed the adoption of the Google-wide experimentation framework in the Monitoring & Alerting space, showcasing its safety. All in the space of a quarter (for context: 4 years prior, a similar effort lasting 6 months resulted in Google's internal users using both the tools instead of the 15-year old tool being fully deprecated).
  • Learned and wrote (almost 10k lines of) Typescript, for the first time in production.
ContainerizationCommunicationTraffic EngineeringTypeScriptProblem Solving

Software Engineer, Site Reliability Engineering - Distributed Systems

Dec 2022May 2023 · 5 mos

  • Docs / Editors @ Google Workspace
  • Derisked observability concerns in a week via 30 code changes requiring approval from over 15 engineering leads and managers in partner dev teams. For context, this problem went unsolved for over 2 quarters before I'd joined the team. I fixed production notifications so that SRE isn't blind to customer pain.
  • Biggest challenges overcome: Slow bureaucracy (by being clear and upfront to stakeholders about the changes being made), learning just enough of an obsolete language to solve the issue end to end.
ContainerizationCommunicationAutomated AlertsStakeholder ManagementProblem Solving

Software Engineer Site Reliability Engineering - Distributed Systems

Apr 2021Present · 5 yrs 2 mos

  • Automated incident troubleshooting @ Google
  • Built and facilitated Google-wide adoption of an automated incident troubleshooting tool; impacting multiple product areas by reducing mean time to mitigation from hours (or even days) to minutes.
System MonitoringAnomaly DetectionAutomationProject Management

Software Engineer, Site Reliability Engineering - Distributed Systems

Mar 2019Dec 2022 · 3 yrs 9 mos

  • Serverless @ Google Cloud Platform
  • Making Google Cloud's Serverless compute offerings (App Engine, Cloud Functions, Cloud Run) more reliable. Check out cloud.google.com/serverless (featured at Google I/O '22 - https://youtu.be/qBkyU1TJKDg?t=2556)
  • Projects
  • Automated stack turnups
  • Enabled safe, slow rollouts of changes to over 500 databases. This changed the status quo of having customers affected globally for multiple hours by bad schema changes to not having such issues thereafter.
  • Set up operational excellence processes that helped the team and dev partner teams to sustainably improve ops load. The SRE team was able to onboard 40% more partner teams whilst continually reducing overall operational load, and getting rid of un-actionable alerts.
  • Built an automated incident root causing system (used Google-wide). This reduced mean time to response and mitigation from multiple hours to just minutes.
  • Operations
  • Experienced oncaller having handled numerous incidents, followed up with postmortems. I've led large scale incident responses spanning numerous days (most notably Log4J remote-code-execution vulnerability response).
  • Improved tooling across products to speed up debugging and impact analysis, by automating toilsome tasks into scripts/notebooks, thus reducing efforts which took hours into a result that's achieved in seconds.
  • Community
  • Mentored 10+ FTEs/interns into their roles
  • Translated numerous math lessons for over 1M+ primary school children in developing countries for Oppia (check out oppia.org)
ContainerizationCommunicationProblem SolvingData AnalysisRelease Management

Booking.com

2 roles

Software Developer

Promoted

Jun 2018Feb 2019 · 8 mos

  • Worked on automating various parts of the machine learning platform.
ContainerizationCommunicationProblem SolvingGrafana

Graduate Software Developer

Jun 2017Jun 2018 · 1 yr

  • Involved in the infrastructure teams for reviews, images and machine learning.
CommunicationProblem Solving

Adobe

Site Reliability Engineer Intern

Jun 2016Aug 2016 · 2 mos · Noida Area, India

  • Adobe Social Query Monitoring and Auto-Remediation of servers using SaltStack, among various other scripts to automate processes on the Adobe Social and Analytics team.
CommunicationProblem Solving

Iiit hyderabad

Teaching Assistant

Jan 2016May 2016 · 4 mos

  • My responsibilities comprise of conducting labs and tutorials, along with setting and testing problems for Data Structures assignments.
CommunicationProblem Solving

Hackerearth

Problem Curator

Jul 2015Apr 2016 · 9 mos

  • My job entails testing problems, and writing editorials for problems to be used in contests.
CommunicationProblem Solving

Zopnow.com

SDE Intern

Jun 2015Jul 2015 · 1 mo · Bengaluru Area, India

  • My responsibilities include automating the the process of planning trips so as to deliver groceries on time and cover the least possible distance to maximize the profits.
CommunicationProblem Solving

Imaginate

Student Software Developer

Aug 2014Nov 2014 · 3 mos · Hyderabad Area, India

  • Undertook a project to make a Robot Control Management Server in order to control robots in a warehouse environment using Django and ROS.
CommunicationProblem Solving

Education

International Institute of Information Technology Hyderabad (IIITH)

Bachelor of Technology (B.Tech.) — Computer Science

Jul 2013May 2017

National Public School

Primary and Secondary School

Jun 2005Mar 2013

Bishop Cotton Boys'​ School

Primary School

Jun 2001Mar 2005

Stackforce found 100+ more professionals with System Monitoring & Automation

Explore similar profiles based on matching skills and experience