Mariswaran R — SRE (Site Reliability Engineer)
DevOps/Site Reliability Engineer with 11+ years of experience building and operating large-scale distributed systems across cloud-native environments. I specialize in Reliability Engineering, Observability, Platform Operations, and AI-driven operational automation. My experience spans AWS, Kubernetes (EKS), OpenSearch, New Relic, Dynatrace, Grafana, Open Telemetry, CI/CD, and large-scale telemetry platforms supporting 1000+ of microservices in production. I focus on improving system reliability through observability, automation, and intelligent operational workflows. My work includes designing monitoring and alerting strategies, driving incident response and root cause analysis, implementing SLO-based reliability practices, and enabling engineering teams to make faster operational decisions. Recently, I have been building MCP (Model Context Protocol) servers and AI-powered operational platforms that unify data across observability, cloud infrastructure, deployment pipelines, and production systems. These platforms provide engineers with a single interface to query operational data, investigate incidents, correlate telemetry, and accelerate troubleshooting using AI-assisted workflows. I am also developing AIOps solutions using machine learning and LLMs for anomaly detection, telemetry correlation, incident summarization, and operational decision support, helping reduce time to detection and time to resolution during production events. Core Expertise: • Site Reliability Engineering (SRE) • Observability Engineering (Metrics, Logs, Traces, APM) • MCP Server Development & AI Agent Integrations • AIOps, ML & LLM-powered Operations • Kubernetes Platform Operations (EKS) • AWS Cloud Infrastructure • OpenSearch, Dynatrace, New Relic and Grafana, Catchpoint • Incident Management & Root Cause Analysis • SLOs, SLIs & Error Budget Governance • CI/CD & Production Release Engineering • Telemetry Pipelines & Monitoring Platforms Currently exploring the future of AI-assisted operations by combining MCP, observability platforms, and operational intelligence to build faster, more reliable, and context-aware engineering workflows. Open to Senior SRE, Staff SRE, Observability Engineering, Reliability Platform Engineering, and AI for Operations opportunities.
Stackforce AI infers this person is a SaaS expert with a focus on Site Reliability Engineering and AI-driven operations.
Location: Chennai, Tamil Nadu, India
Experience: 10 yrs 6 mos
Skills
- Site Reliability Engineering
- Platform Operations
- Aws Cloud Infrastructure
- Aiops
- Observability Engineering
- Mcp Server Development
Career Highlights
- 11+ years in Site Reliability Engineering and DevOps.
- Expert in AI-driven operational automation and observability.
- Proven track record in managing large-scale distributed systems.
Work Experience
Optum
Senior Platform Engineer (0 mo)
Verizon
Engineer Consultant - Software Development (3 yrs 5 mos)
Engineer - Software Development (4 yrs)
Sears Holdings India
Technical Associate (8 mos)
Tata Consultancy Services
System Engineer (1 yr 8 mos)
Assistant System Engineer (9 mos)
Education
Bachelor of Engineering - BE at Anna University Chennai
Mathematics at State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)
Mathematics and Computer Science at State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)