Sumit Chachadi

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India12 yrs 5 mos experience

Most Likely To SwitchAI ML Practitioner

Key Highlights

Proven incident commander with 50+ Sev-0/Sev-1 resolutions.
Architected AI-powered operations platforms reducing operational toil.
Achieved 300% QoQ growth in user engagement.

Stackforce AI infers this person is a Site Reliability Engineer specializing in AI-driven operations and cloud infrastructure.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud InfrastructureAi OperationsIncident ManagementObservabilityAlert ManagementFrontend DevelopmentData EngineeringNetdevopsNetwork AutomationTest AutomationSoftware Development

Other Skills

KubernetesAWSPythonPrometheusGrafanaOpenTelemetryFastAPIClaudeMCPAPI DevelopmentReactTypeScriptGitHubLLMCI/CD

About

Senior Site Reliability Engineer with 8+ years of experience building and operating large-scale distributed systems at Airbnb and Cisco. Deep expertise in Python automation, observability infrastructure (Prometheus, Grafana, OpenTelemetry), and cloud-native platforms (Kubernetes, AWS). Currently building AIOps and LLM-powered operations platforms using Model Context Protocol (MCP), agentic AI workflows, and RAG-based automation to reduce operational toil and accelerate incident resolution. Proven incident commander with 50+ Sev-0/Sev-1 resolutions; experienced in defining SLOs/SLIs, managing error budgets, and driving high availability and fault tolerance across distributed services.. Previously, I worked at Cisco, where I leveraged my Master’s degree to lead networking automation and NetDevOps initiatives.

Experience

12 yrs 5 mos

Total Experience

3 yrs 1 mo

Average Tenure

3 yrs 6 mos

Current Experience

Airbnb

Site Reliability Engineer

Nov 2022 – Present · 3 yrs 6 mos · Greater Bengaluru Area

Senior SRE at Airbnb building reliability, observability, and AI-powered operations platforms.
Consecutively rated "Exceeds Airbnb's High Expectations" — top performance tier (2024, 2025).
Incident commander for 50+ Sev-0/Sev-1 incidents with p90 MTTR under 1 hour.
OPSBOT / MATIK — AI-Powered Operations Platform
Architected LLM-driven ops platform (Python, FastAPI, Claude, MCP) — 2,000 RPS, 9 MCP integrations, 25+ production services
Natural language incident triage, automated runbooks, and alert response via Slack — ~70% L1 on-call toil reduction
Designed for 99.9% availability; defined SLIs (P99 latency, error rates) and tracked 30-day error budgets
Grew to 1,000+ active users with 300% QoQ growth; primary on-call triage interface org-wide
MAESTRO — Observability Indexer
Auto-discovery indexer across 1,900+ Git repos for org-wide alerting/logging coverage and SLO visibility
99.8% API call reduction (10,000/hr to 120/hr), saving ~$50K/yr
CAWS — Centralized Alerting
Migrated 150+ legacy alerts; actionable alert ratio 11% to 55% (5x) across 10+ teams
Built Go scheduled downtime app for programmatic alert silencing with Prometheus metrics
SPOG — Single Pane of Glass
React/TypeScript frontend + 30+ backend APIs for unified visibility platform used by 200+ engineers
98% page load reduction; 90% faster org-tree fetching
MATIK — Reliability Data Platform
GitHub event catalog: 35,000+ PRs, 500+ repos for incident correlation and reliability reporting
92% LLM/API cost reduction (~$80K/yr); pipeline time 150min to 40min (73% faster)
INCIDENT COMMAND & ON-CALL
50+ Sev-0/Sev-1 incidents · p90 MTTR under 1 hour · 25+ blameless postmortems
Auto-discovery monitoring for 1,000+ targets; custom Prometheus exporters on Kubernetes with Helm

KubernetesAWSPythonPrometheusGrafanaOpenTelemetry+3

Cisco

3 roles

Lead Engineer - NetDevOps

Aug 2021 – Nov 2022 · 1 yr 3 mos

Oversee end-to-end validation lifecycle for our customers including Designing validation plan, Automation Development, Automation Execution, Report analysis, Deploy new changes, Operate and maintain the feature in a CI/CD model.
Design framework and tools for Cisco network automation libraries on an API-based architecture. The key focus is on re-usability, scalability and portability. The automation libraries are vendor neutral and support interfacing with REST API and vendor OS like junos, huawei, ios.
Interfacing directly with customer to identify requirements, understand concerns and propose tailored subscription/transactional offers that fits their needs. Identify resources with right technical skillsets for project engagements.
Develop libraries for seamless integration of industry standard test automation frameworks like pyATS, ansible and robotframework. This allows for network test automation to be integrated into customer product and software upgrade pipelines and move it into a netdevops paradigm
Implementing agile: Creating user stories, conducting planning, retrospectives and daily stand-up calls for smoother transitions of change requests. Curate JIRA dashboards and sprints for tracking project metrics

Network AutomationCI/CDREST APIAnsiblepyATSNetDevOps

Lead Automation Engineer

Promoted

Feb 2020 – Nov 2022 · 2 yrs 9 mos

Leading a mid-sized team to drive innovation in automation. The team has been successful in reducing the turnaround time for test automation by 35% and has increased the automation adoption among networking professionals by 20%.
Leading a team to adopt to crowdsourcing model to add an extra throughput for test automation. The team has increased the KPI of automation by 40% and has been an excellent initiative for development of skills across the team.
Developed Python scripts for collecting and charting the automation metrics, giving a holistic view of the progress of the organisation and scope for improvement per fiscal quarter. The project gained appreciation from management team higher in hierarchy and has become a part of their quarterly reporting.
Conducted trainings in Python, Network Automation, Integrating test frameworks, Devnet Certifications and best practices in automation to uplift the skills of the broader organization. I strongly believe that team-growth is as important as personal growth.
Developing networking test automation frameworks using Python, Robot Framework, Ansible.
Automating operation of Viptela SD-WAN and monitoring it's performance. Testing the feature deployments for end customers to ensure a bug-free production deployment.

PythonRobot FrameworkAnsibleTest Automation

Software Engineer

May 2017 – Feb 2020 · 2 yrs 9 mos

Technologies: Python 3, PyATS, Cafy, TextFSM, Paramiko, Segment Routing, JIRA, Docker, Confluence
Softwares: Confluence, JIRA, Box, Smartsheets, IXIA, Spirent.
Agile: SCRUM, Sprint, Retrospectives
● Working on automating tests and policy execution and validation for Cisco devices/topology.
● Involves extensive work on various ASR9k, NCS5k and XTC Controllers.
● Developed policy generator for scale profile testing which helped manual and automation test team to test upto 16k policies
● Handled a complete project under Segment Routing, named Segment Routing Shortcuts which gained recognition for use of common scripts, greatly reducing developent and debugging time
● Gained valuable experience working with traffic generators - Sprirent and IXIA. Handled API calls to both using REST Interface and their respective python API.
● Extensive experience in efficiently using JIRA, Confluence, Cisco Test Manager 2 and git.
● Developed python API for various IOS commands and traffic generators.
● Used Google's TextFSM and Cisco's cafy for creating regex parsers to extract parameters needed from the test devices.
● Learnt to work under agile methodology following Sprint with a 2 week deadlines.
● Conducted and looked after daily stand-up calls for the sprints.
● Worked on testing cutting-edge development features in Cisco's Segment-Routing.
● Led the team and researched on using virtual devices for the test automation using VIRL and NetSim. Used their REST API for automating topology creation and validation.
● Meticulous use of Python 3 libraries and Cisco's in-house testing library - PyATS.

PythonJIRADockerSoftware Development

University at buffalo

Student Assistant

Sep 2015 – Apr 2017 · 1 yr 7 mos · Buffalo/Niagara, New York Area

Ieee-git

2 roles

Chairperson

May 2014 – May 2015 · 1 yr

● Mainly concerned with the organizing and managing the events in the college. Representative from the college at the Sectional level.
● Lead a team of 7 committee members and managed a group of 84 volunteers for organizing Paanchajanya – 2015.
● Worked in team to raise funds worth $400 to meet the operating costs of Student Activities under IEEE

Webmaster

Jul 2013 – May 2014 · 10 mos

● Maintaining the server at IEEE GIT and to update the activities of the student branch on the website. Also involved all the correspondence with the section.