Shashankh Subramani

CEO

Bengaluru, Karnataka, India5 yrs 7 mos experience

Highly StableAI Enabled

Key Highlights

Expert in AI-driven incident detection and prevention.
Proven track record in architecting scalable observability solutions.
Skilled in automation-driven reliability for cloud platforms.

Stackforce AI infers this person is a Site Reliability Engineer specializing in AI/ML operations within the SaaS industry.

Contact

Skills

Core Skills

Site Reliability EngineeringAi/ml OperationsDeployment Pipelines

Other Skills

AI/AIOpsAnomaly DetectionKPI AnalyticsPrometheusNeo4jArgo WorkflowsTerraformGrafanaAI/MLData CollectionModel DeploymentMonitoringAlertingAnsibleCI/CD

About

Experienced Site Reliability Engineer specializing in AI/AIOps, large-scale cloud observability platforms, and automation-driven reliability. Proven track record in architecting AI/ML-powered incident detection and prevention pipelines, designing resilient and scalable metric/log processing systems, and delivering secure, automated deployment frameworks. Adept at reducing operational noise, enhancing root cause analysis with graph-based infrastructure mapping, and accelerating incident resolution through intelligent alerting and synthetic testing. Skilled in driving cross-functional delivery for critical infrastructure projects, from AI-assisted operational insights to multi-region observability solutions, ensuring uptime and performance for mission-critical services.

Experience

5 yrs 7 mos

Total Experience

2 yrs 9 mos

Average Tenure

1 mo

Current Experience

Mila health inc

GenAI

May 2026 – Present · 1 mo

Cisco

3 roles

Site Reliability Engineer 2 | Building Future-Ready Observability with AI

Promoted

Sep 2022 – Apr 2026 · 3 yrs 7 mos

AI/AIOps
Built production-grade AI/ML operational pipelines — delivering end-to-end model lifecycle management (data collection, training, deployment, monitoring, retraining) for anomaly detection and KPI analytics, enabling proactive incident prevention and operational forecasting.
Deployed on-premises AI-focused metric storage solutions (Prometheus) integrated with Telegraf metric relays to support high-volume anomaly detection and time-series analysis at scale.
Architected component and infrastructure mapping in Neo4j, integrating data from VMs, hypervisors, network devices, and storage — enabling faster root cause analysis and dependency resolution.
Orchestrated and Lead AI-driven observability initiatives, including:
Anomaly detection pipeline with noise reduction and actionable alerting.
KPI analytics stack integrating ingestion, querying, and visualizations.
Integration of AI agents with Thanos (Prometheus HA) metric stores using MCP server extensions.
Observability/SRE
Automated incident handling and alert correlation by integrating Argo Workflows with PagerDuty and WebEx, enabling faster incident triage and reducing manual operational workload.
Scaled and optimized metrics pipelines through capacity benchmarking, load analysis, and performance tuning — enabling deployments to larger environments while improving throughput and stability.
Engineered end‑to‑end Elastic Cloud automation using Terraform and PrivateLink/Kibana access controls, creating secure, repeatable deployments across multiple environments.
Designed and deployed advanced observability dashboards in Grafana consolidating host metrics, process health, and system status — enabling Support teams to quickly assess service health.
Participated actively in on-call rotations, addressing production issues across infrastructure and application layers; contributed to feature development and long-term maintenance of internal tooling and services.

AI/AIOpsAnomaly DetectionKPI AnalyticsPrometheusNeo4jArgo Workflows+4

Site Reliability Engineer 1 | Building production grade deployment pipelines for monitoring tools

Aug 2021 – Aug 2022 · 1 yr

Building Observability as a Service by creating data pipelines for logs and metrics. Have experience deploying and configuring monitoring in both On-Prem (OpenStack) and Off-Prem (AWS) Infrastructures.
Tech: Ansible/Terraform, CI/CD, Docker, Kubernetes/Nomad
Telegraf, InfluxDB, Prometheus, Grafana, Kinesis/Kafka, ELK Stack

AnsibleTerraformCI/CDDockerKubernetesTelegraf+8

Technical Undergraduate Intern

Feb 2021 – Jul 2021 · 5 mos

Developed a tool for collecting and storing, capacity and utilization metrics of the On-prem Cloud
Used data analytics tool to create business intelligence reports and predict future trends in these metrics
Software: Python, GitLab CI/CD, MariaDB Galera Cluster, Tableau

PythonGitLab CI/CDMariaDB Galera ClusterTableau

Face - forum for aspiring computer engineers

Executive

Jul 2019 – Jun 2020 · 11 mos · Bangalore

Healthedge

Intern - Big Data Engineering

May 2019 – Jun 2019 · 1 mo · Bengaluru Area, India

Worked on team in charge for big data analytics and machine learning. Was tasked with hydration of data lake by moving data from an operational data store to the data lake. An inference engine was built based on this data lake for providing predictions for chronic health conditions.