Shashankh Subramani

CEO

Bengaluru, Karnataka, India5 yrs 7 mos experience
Highly StableAI Enabled

Key Highlights

  • Expert in AI-driven incident detection and prevention.
  • Proven track record in architecting scalable observability solutions.
  • Skilled in automation-driven reliability for cloud platforms.
Stackforce AI infers this person is a Site Reliability Engineer specializing in AI/ML operations within the SaaS industry.

Contact

Skills

Core Skills

Site Reliability EngineeringAi/ml OperationsDeployment Pipelines

Other Skills

AI/AIOpsAnomaly DetectionKPI AnalyticsPrometheusNeo4jArgo WorkflowsTerraformGrafanaAI/MLData CollectionModel DeploymentMonitoringAlertingAnsibleCI/CD

About

Experienced Site Reliability Engineer specializing in AI/AIOps, large-scale cloud observability platforms, and automation-driven reliability. Proven track record in architecting AI/ML-powered incident detection and prevention pipelines, designing resilient and scalable metric/log processing systems, and delivering secure, automated deployment frameworks. Adept at reducing operational noise, enhancing root cause analysis with graph-based infrastructure mapping, and accelerating incident resolution through intelligent alerting and synthetic testing. Skilled in driving cross-functional delivery for critical infrastructure projects, from AI-assisted operational insights to multi-region observability solutions, ensuring uptime and performance for mission-critical services.

Experience

5 yrs 7 mos
Total Experience
2 yrs 9 mos
Average Tenure
1 mo
Current Experience

Mila health inc

GenAI

May 2026Present · 1 mo

Cisco

3 roles

Site Reliability Engineer 2 | Building Future-Ready Observability with AI

Promoted

Sep 2022Apr 2026 · 3 yrs 7 mos

  • AI/AIOps
  • Built production-grade AI/ML operational pipelines — delivering end-to-end model lifecycle management (data collection, training, deployment, monitoring, retraining) for anomaly detection and KPI analytics, enabling proactive incident prevention and operational forecasting.
  • Deployed on-premises AI-focused metric storage solutions (Prometheus) integrated with Telegraf metric relays to support high-volume anomaly detection and time-series analysis at scale.
  • Architected component and infrastructure mapping in Neo4j, integrating data from VMs, hypervisors, network devices, and storage — enabling faster root cause analysis and dependency resolution.
  • Orchestrated and Lead AI-driven observability initiatives, including:
  • Anomaly detection pipeline with noise reduction and actionable alerting.
  • KPI analytics stack integrating ingestion, querying, and visualizations.
  • Integration of AI agents with Thanos (Prometheus HA) metric stores using MCP server extensions.
  • Observability/SRE
  • Automated incident handling and alert correlation by integrating Argo Workflows with PagerDuty and WebEx, enabling faster incident triage and reducing manual operational workload.
  • Scaled and optimized metrics pipelines through capacity benchmarking, load analysis, and performance tuning — enabling deployments to larger environments while improving throughput and stability.
  • Engineered end‑to‑end Elastic Cloud automation using Terraform and PrivateLink/Kibana access controls, creating secure, repeatable deployments across multiple environments.
  • Designed and deployed advanced observability dashboards in Grafana consolidating host metrics, process health, and system status — enabling Support teams to quickly assess service health.
  • Participated actively in on-call rotations, addressing production issues across infrastructure and application layers; contributed to feature development and long-term maintenance of internal tooling and services.
AI/AIOpsAnomaly DetectionKPI AnalyticsPrometheusNeo4jArgo Workflows+4

Site Reliability Engineer 1 | Building production grade deployment pipelines for monitoring tools

Aug 2021Aug 2022 · 1 yr

  • Building Observability as a Service by creating data pipelines for logs and metrics. Have experience deploying and configuring monitoring in both On-Prem (OpenStack) and Off-Prem (AWS) Infrastructures.
  • Tech: Ansible/Terraform, CI/CD, Docker, Kubernetes/Nomad
  • Telegraf, InfluxDB, Prometheus, Grafana, Kinesis/Kafka, ELK Stack
AnsibleTerraformCI/CDDockerKubernetesTelegraf+8

Technical Undergraduate Intern

Feb 2021Jul 2021 · 5 mos

  • Developed a tool for collecting and storing, capacity and utilization metrics of the On-prem Cloud
  • Used data analytics tool to create business intelligence reports and predict future trends in these metrics
  • Software: Python, GitLab CI/CD, MariaDB Galera Cluster, Tableau
PythonGitLab CI/CDMariaDB Galera ClusterTableau

Face - forum for aspiring computer engineers

Executive

Jul 2019Jun 2020 · 11 mos · Bangalore

Healthedge

Intern - Big Data Engineering

May 2019Jun 2019 · 1 mo · Bengaluru Area, India

  • Worked on team in charge for big data analytics and machine learning. Was tasked with hydration of data lake by moving data from an operational data store to the data lake. An inference engine was built based on this data lake for providing predictions for chronic health conditions.

Education

The University of Manchester

Master of Science - MS — Advanced Computer Science (Artificial Intelligence)

Sep 2023Dec 2024

Amrita Vishwa Vidyapeetham

Bachelor of Technology — Computer Science and Engineering

Jun 2017Jul 2021

Clarence High School Bangalore

Jan 2005Jan 2015

Stackforce found 100+ more professionals with Site Reliability Engineering & Ai/ml Operations

Explore similar profiles based on matching skills and experience