Tanmay Sahay

DevOps Engineer

Mountain View, California, United States8 yrs 11 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

Expert in building reliable distributed systems at scale.
Pioneered automated incident troubleshooting tools at Google.
Led significant improvements in monitoring and alerting systems.

Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer with expertise in SRE and distributed systems.

Contact

Skills

Core Skills

System MonitoringAutomationArtificial Intelligence (ai)Natural Language Processing (nlp)NetworkingTraffic EngineeringContainerizationStakeholder ManagementAutomated AlertsData AnalysisRelease ManagementProblem Solving

Other Skills

Anomaly DetectionProject ManagementAlertingIncident ResponseCapacity ManagementInfrastructure as code (IaC)CommunicationAutonomous Systems (Internet)Border Gateway Protocol (BGP)IS-ISTypeScriptGrafanaInferenceC++Algorithms

About

Read more at https://tanmaysahay.com I have an interest in architecting and building large scale distributed systems running at their peak; and having reliability automated away as much as possible. In a domain where failure is inevitable, continuously learning from it keeps me going.

Experience

8 yrs 11 mos

Total Experience

4 yrs 5 mos

Average Tenure

7 yrs 3 mos

Current Experience

Google

7 roles

Software Engineer, Distributed Systems

Feb 2026 – Present · 4 mos

Building your autonomous builder swarm.

Software Engineer Site Reliability Engineering - Distributed Systems

Apr 2025 – Feb 2026 · 10 mos

ML Inference @ Google (Cloud 🫱🏼‍🫲🏾 Gemini 🟰 Vertex AI)
Making LLM serving/inference for Gemini delightfully reliable & reliably delightful at a planet-scale.
Oh, and economical too.

System MonitoringAlertingArtificial Intelligence (AI)Natural Language Processing (NLP)Incident ResponseAnomaly Detection+2

Software Engineer Site Reliability Engineering - Distributed Systems

Feb 2024 – Mar 2025 · 1 yr 1 mo

WAN @ Google
Keep Google visible to the world.
How? Manage, at scale, peering technologies & intermediate systems via SoTA observability & mitigation systems.
Why? To enable you to watch cat videos while simultaneously allowing you to build a variety of workloads in the cloud (maybe ML inference, or anything else for that matter). This is the forcing function ensuring Google handles multitudes of traffic and workload types.

ContainerizationCommunicationProblem SolvingNetworkingAutonomous Systems (Internet)Border Gateway Protocol (BGP)+3

Software Engineer, Site Reliability Engineering - Distributed Systems

May 2023 – Feb 2024 · 9 mos

Monitoring and alerting for GCP and Google-internal services.
Successfully migrated all internal users from 2 disparate tools (15-year old (v1) and 5-year old (v2)) to the next-gen alert visualisation tool (v3) with zero downtime. As a result, I trail-blazed the adoption of the Google-wide experimentation framework in the Monitoring & Alerting space, showcasing its safety. All in the space of a quarter (for context: 4 years prior, a similar effort lasting 6 months resulted in Google's internal users using both the tools instead of the 15-year old tool being fully deprecated).
Learned and wrote (almost 10k lines of) Typescript, for the first time in production.

ContainerizationCommunicationTraffic EngineeringTypeScriptProblem Solving

Software Engineer, Site Reliability Engineering - Distributed Systems

Dec 2022 – May 2023 · 5 mos

Docs / Editors @ Google Workspace
Derisked observability concerns in a week via 30 code changes requiring approval from over 15 engineering leads and managers in partner dev teams. For context, this problem went unsolved for over 2 quarters before I'd joined the team. I fixed production notifications so that SRE isn't blind to customer pain.
Biggest challenges overcome: Slow bureaucracy (by being clear and upfront to stakeholders about the changes being made), learning just enough of an obsolete language to solve the issue end to end.

ContainerizationCommunicationAutomated AlertsStakeholder ManagementProblem Solving

Software Engineer Site Reliability Engineering - Distributed Systems

Apr 2021 – Present · 5 yrs 2 mos

Automated incident troubleshooting @ Google
Built and facilitated Google-wide adoption of an automated incident troubleshooting tool; impacting multiple product areas by reducing mean time to mitigation from hours (or even days) to minutes.

System MonitoringAnomaly DetectionAutomationProject Management

Software Engineer, Site Reliability Engineering - Distributed Systems

Mar 2019 – Dec 2022 · 3 yrs 9 mos

Serverless @ Google Cloud Platform
Making Google Cloud's Serverless compute offerings (App Engine, Cloud Functions, Cloud Run) more reliable. Check out cloud.google.com/serverless (featured at Google I/O '22 - https://youtu.be/qBkyU1TJKDg?t=2556)
Projects
Automated stack turnups
Enabled safe, slow rollouts of changes to over 500 databases. This changed the status quo of having customers affected globally for multiple hours by bad schema changes to not having such issues thereafter.
Set up operational excellence processes that helped the team and dev partner teams to sustainably improve ops load. The SRE team was able to onboard 40% more partner teams whilst continually reducing overall operational load, and getting rid of un-actionable alerts.
Built an automated incident root causing system (used Google-wide). This reduced mean time to response and mitigation from multiple hours to just minutes.
Operations
Experienced oncaller having handled numerous incidents, followed up with postmortems. I've led large scale incident responses spanning numerous days (most notably Log4J remote-code-execution vulnerability response).
Improved tooling across products to speed up debugging and impact analysis, by automating toilsome tasks into scripts/notebooks, thus reducing efforts which took hours into a result that's achieved in seconds.
Community
Mentored 10+ FTEs/interns into their roles
Translated numerous math lessons for over 1M+ primary school children in developing countries for Oppia (check out oppia.org)

ContainerizationCommunicationProblem SolvingData AnalysisRelease Management

Booking.com

2 roles

Software Developer

Promoted

Jun 2018 – Feb 2019 · 8 mos

Worked on automating various parts of the machine learning platform.

ContainerizationCommunicationProblem SolvingGrafana

Graduate Software Developer

Jun 2017 – Jun 2018 · 1 yr

Involved in the infrastructure teams for reviews, images and machine learning.

CommunicationProblem Solving

Adobe

Site Reliability Engineer Intern

Jun 2016 – Aug 2016 · 2 mos · Noida Area, India

Adobe Social Query Monitoring and Auto-Remediation of servers using SaltStack, among various other scripts to automate processes on the Adobe Social and Analytics team.

CommunicationProblem Solving

Iiit hyderabad

Teaching Assistant

Jan 2016 – May 2016 · 4 mos

My responsibilities comprise of conducting labs and tutorials, along with setting and testing problems for Data Structures assignments.

CommunicationProblem Solving

Hackerearth

Problem Curator

Jul 2015 – Apr 2016 · 9 mos

My job entails testing problems, and writing editorials for problems to be used in contests.

CommunicationProblem Solving

Zopnow.com

SDE Intern

Jun 2015 – Jul 2015 · 1 mo · Bengaluru Area, India

My responsibilities include automating the the process of planning trips so as to deliver groceries on time and cover the least possible distance to maximize the profits.

CommunicationProblem Solving