Sumit Chachadi

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India12 yrs 5 mos experience
Most Likely To SwitchAI ML Practitioner

Key Highlights

  • Proven incident commander with 50+ Sev-0/Sev-1 resolutions.
  • Architected AI-powered operations platforms reducing operational toil.
  • Achieved 300% QoQ growth in user engagement.
Stackforce AI infers this person is a Site Reliability Engineer specializing in AI-driven operations and cloud infrastructure.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud InfrastructureAi OperationsIncident ManagementObservabilityAlert ManagementFrontend DevelopmentData EngineeringNetdevopsNetwork AutomationTest AutomationSoftware Development

Other Skills

KubernetesAWSPythonPrometheusGrafanaOpenTelemetryFastAPIClaudeMCPAPI DevelopmentReactTypeScriptGitHubLLMCI/CD

About

Senior Site Reliability Engineer with 8+ years of experience building and operating large-scale distributed systems at Airbnb and Cisco. Deep expertise in Python automation, observability infrastructure (Prometheus, Grafana, OpenTelemetry), and cloud-native platforms (Kubernetes, AWS). Currently building AIOps and LLM-powered operations platforms using Model Context Protocol (MCP), agentic AI workflows, and RAG-based automation to reduce operational toil and accelerate incident resolution. Proven incident commander with 50+ Sev-0/Sev-1 resolutions; experienced in defining SLOs/SLIs, managing error budgets, and driving high availability and fault tolerance across distributed services.. Previously, I worked at Cisco, where I leveraged my Master’s degree to lead networking automation and NetDevOps initiatives.

Experience

12 yrs 5 mos
Total Experience
3 yrs 1 mo
Average Tenure
3 yrs 6 mos
Current Experience

Airbnb

Site Reliability Engineer

Nov 2022Present · 3 yrs 6 mos · Greater Bengaluru Area

  • Senior SRE at Airbnb building reliability, observability, and AI-powered operations platforms.
  • Consecutively rated "Exceeds Airbnb's High Expectations" — top performance tier (2024, 2025).
  • Incident commander for 50+ Sev-0/Sev-1 incidents with p90 MTTR under 1 hour.
  • OPSBOT / MATIK — AI-Powered Operations Platform
  • Architected LLM-driven ops platform (Python, FastAPI, Claude, MCP) — 2,000 RPS, 9 MCP integrations, 25+ production services
  • Natural language incident triage, automated runbooks, and alert response via Slack — ~70% L1 on-call toil reduction
  • Designed for 99.9% availability; defined SLIs (P99 latency, error rates) and tracked 30-day error budgets
  • Grew to 1,000+ active users with 300% QoQ growth; primary on-call triage interface org-wide
  • MAESTRO — Observability Indexer
  • Auto-discovery indexer across 1,900+ Git repos for org-wide alerting/logging coverage and SLO visibility
  • 99.8% API call reduction (10,000/hr to 120/hr), saving ~$50K/yr
  • CAWS — Centralized Alerting
  • Migrated 150+ legacy alerts; actionable alert ratio 11% to 55% (5x) across 10+ teams
  • Built Go scheduled downtime app for programmatic alert silencing with Prometheus metrics
  • SPOG — Single Pane of Glass
  • React/TypeScript frontend + 30+ backend APIs for unified visibility platform used by 200+ engineers
  • 98% page load reduction; 90% faster org-tree fetching
  • MATIK — Reliability Data Platform
  • GitHub event catalog: 35,000+ PRs, 500+ repos for incident correlation and reliability reporting
  • 92% LLM/API cost reduction (~$80K/yr); pipeline time 150min to 40min (73% faster)
  • INCIDENT COMMAND & ON-CALL
  • 50+ Sev-0/Sev-1 incidents · p90 MTTR under 1 hour · 25+ blameless postmortems
  • Auto-discovery monitoring for 1,000+ targets; custom Prometheus exporters on Kubernetes with Helm
KubernetesAWSPythonPrometheusGrafanaOpenTelemetry+3

Cisco

3 roles

Lead Engineer - NetDevOps

Aug 2021Nov 2022 · 1 yr 3 mos

  • Oversee end-to-end validation lifecycle for our customers including Designing validation plan, Automation Development, Automation Execution, Report analysis, Deploy new changes, Operate and maintain the feature in a CI/CD model.
  • Design framework and tools for Cisco network automation libraries on an API-based architecture. The key focus is on re-usability, scalability and portability. The automation libraries are vendor neutral and support interfacing with REST API and vendor OS like junos, huawei, ios.
  • Interfacing directly with customer to identify requirements, understand concerns and propose tailored subscription/transactional offers that fits their needs. Identify resources with right technical skillsets for project engagements.
  • Develop libraries for seamless integration of industry standard test automation frameworks like pyATS, ansible and robotframework. This allows for network test automation to be integrated into customer product and software upgrade pipelines and move it into a netdevops paradigm
  • Implementing agile: Creating user stories, conducting planning, retrospectives and daily stand-up calls for smoother transitions of change requests. Curate JIRA dashboards and sprints for tracking project metrics
Network AutomationCI/CDREST APIAnsiblepyATSNetDevOps

Lead Automation Engineer

Promoted

Feb 2020Nov 2022 · 2 yrs 9 mos

  • Leading a mid-sized team to drive innovation in automation. The team has been successful in reducing the turnaround time for test automation by 35% and has increased the automation adoption among networking professionals by 20%.
  • Leading a team to adopt to crowdsourcing model to add an extra throughput for test automation. The team has increased the KPI of automation by 40% and has been an excellent initiative for development of skills across the team.
  • Developed Python scripts for collecting and charting the automation metrics, giving a holistic view of the progress of the organisation and scope for improvement per fiscal quarter. The project gained appreciation from management team higher in hierarchy and has become a part of their quarterly reporting.
  • Conducted trainings in Python, Network Automation, Integrating test frameworks, Devnet Certifications and best practices in automation to uplift the skills of the broader organization. I strongly believe that team-growth is as important as personal growth.
  • Developing networking test automation frameworks using Python, Robot Framework, Ansible.
  • Automating operation of Viptela SD-WAN and monitoring it's performance. Testing the feature deployments for end customers to ensure a bug-free production deployment.
PythonRobot FrameworkAnsibleTest Automation

Software Engineer

May 2017Feb 2020 · 2 yrs 9 mos

  • Technologies: Python 3, PyATS, Cafy, TextFSM, Paramiko, Segment Routing, JIRA, Docker, Confluence
  • Softwares: Confluence, JIRA, Box, Smartsheets, IXIA, Spirent.
  • Agile: SCRUM, Sprint, Retrospectives
  • ● Working on automating tests and policy execution and validation for Cisco devices/topology.
  • ● Involves extensive work on various ASR9k, NCS5k and XTC Controllers.
  • ● Developed policy generator for scale profile testing which helped manual and automation test team to test upto 16k policies
  • ● Handled a complete project under Segment Routing, named Segment Routing Shortcuts which gained recognition for use of common scripts, greatly reducing developent and debugging time
  • ● Gained valuable experience working with traffic generators - Sprirent and IXIA. Handled API calls to both using REST Interface and their respective python API.
  • ● Extensive experience in efficiently using JIRA, Confluence, Cisco Test Manager 2 and git.
  • ● Developed python API for various IOS commands and traffic generators.
  • ● Used Google's TextFSM and Cisco's cafy for creating regex parsers to extract parameters needed from the test devices.
  • ● Learnt to work under agile methodology following Sprint with a 2 week deadlines.
  • ● Conducted and looked after daily stand-up calls for the sprints.
  • ● Worked on testing cutting-edge development features in Cisco's Segment-Routing.
  • ● Led the team and researched on using virtual devices for the test automation using VIRL and NetSim. Used their REST API for automating topology creation and validation.
  • ● Meticulous use of Python 3 libraries and Cisco's in-house testing library - PyATS.
PythonJIRADockerSoftware Development

University at buffalo

Student Assistant

Sep 2015Apr 2017 · 1 yr 7 mos · Buffalo/Niagara, New York Area

Ieee-git

2 roles

Chairperson

May 2014May 2015 · 1 yr

  • ● Mainly concerned with the organizing and managing the events in the college. Representative from the college at the Sectional level.
  • ● Lead a team of 7 committee members and managed a group of 84 volunteers for organizing Paanchajanya – 2015.
  • ● Worked in team to raise funds worth $400 to meet the operating costs of Student Activities under IEEE

Webmaster

Jul 2013May 2014 · 10 mos

  • ● Maintaining the server at IEEE GIT and to update the activities of the student branch on the website. Also involved all the correspondence with the section.

Education

University at Buffalo

Master’s Degree — Computer Systems Networking and Telecommunications

Jan 2015Jan 2017

Gogte Institute of Technology

Bachelor’s Degree

Jan 2011Jun 2015

Shri Satya Sai Loka Seva P.U. College

Associate's Degree — Science

Jan 2010Jan 2011

Stackforce found 100+ more professionals with Site Reliability Engineering & Cloud Infrastructure

Explore similar profiles based on matching skills and experience