Mariswaran R — SRE (Site Reliability Engineer)

DevOps/Site Reliability Engineer with 11+ years of experience building and operating large-scale distributed systems across cloud-native environments. I specialize in Reliability Engineering, Observability, Platform Operations, and AI-driven operational automation. My experience spans AWS, Kubernetes (EKS), OpenSearch, New Relic, Dynatrace, Grafana, Open Telemetry, CI/CD, and large-scale telemetry platforms supporting 1000+ of microservices in production. I focus on improving system reliability through observability, automation, and intelligent operational workflows. My work includes designing monitoring and alerting strategies, driving incident response and root cause analysis, implementing SLO-based reliability practices, and enabling engineering teams to make faster operational decisions. Recently, I have been building MCP (Model Context Protocol) servers and AI-powered operational platforms that unify data across observability, cloud infrastructure, deployment pipelines, and production systems. These platforms provide engineers with a single interface to query operational data, investigate incidents, correlate telemetry, and accelerate troubleshooting using AI-assisted workflows. I am also developing AIOps solutions using machine learning and LLMs for anomaly detection, telemetry correlation, incident summarization, and operational decision support, helping reduce time to detection and time to resolution during production events. Core Expertise: • Site Reliability Engineering (SRE) • Observability Engineering (Metrics, Logs, Traces, APM) • MCP Server Development & AI Agent Integrations • AIOps, ML & LLM-powered Operations • Kubernetes Platform Operations (EKS) • AWS Cloud Infrastructure • OpenSearch, Dynatrace, New Relic and Grafana, Catchpoint • Incident Management & Root Cause Analysis • SLOs, SLIs & Error Budget Governance • CI/CD & Production Release Engineering • Telemetry Pipelines & Monitoring Platforms Currently exploring the future of AI-assisted operations by combining MCP, observability platforms, and operational intelligence to build faster, more reliable, and context-aware engineering workflows. Open to Senior SRE, Staff SRE, Observability Engineering, Reliability Platform Engineering, and AI for Operations opportunities.

Stackforce AI infers this person is a SaaS expert with a focus on Site Reliability Engineering and AI-driven operations.

Location: Chennai, Tamil Nadu, India

Experience: 10 yrs 6 mos

Skills

Site Reliability Engineering
Platform Operations
Aws Cloud Infrastructure
Aiops
Observability Engineering
Mcp Server Development

Career Highlights

11+ years in Site Reliability Engineering and DevOps.
Expert in AI-driven operational automation and observability.
Proven track record in managing large-scale distributed systems.

Work Experience

Optum

Senior Platform Engineer (0 mo)

Verizon

Engineer Consultant - Software Development (3 yrs 5 mos)

Engineer - Software Development (4 yrs)

Sears Holdings India

Technical Associate (8 mos)

Tata Consultancy Services

System Engineer (1 yr 8 mos)

Assistant System Engineer (9 mos)

Education

Bachelor of Engineering - BE at Anna University Chennai

Mathematics at State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)

Mathematics and Computer Science at State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)

Mariswaran R

SRE (Site Reliability Engineer)

Chennai, Tamil Nadu, India10 yrs 6 mos experience

AI ML PractitionerAI Enabled

Key Highlights

11+ years in Site Reliability Engineering and DevOps.
Expert in AI-driven operational automation and observability.
Proven track record in managing large-scale distributed systems.

Stackforce AI infers this person is a SaaS expert with a focus on Site Reliability Engineering and AI-driven operations.

Contact

Skills

Core Skills

Site Reliability EngineeringPlatform OperationsAws Cloud InfrastructureAiopsObservability EngineeringMcp Server Development

Other Skills

Server ArchitectureAWSKubernetesIncident ManagementObservabilityAI-assisted operational toolingAnomaly detectionTelemetry correlationMCPOpenSearchGrafanaNew RelicBatch ControlBashProblem Solving

About

Experience

10 yrs 6 mos

Total Experience

3 yrs 6 mos

Average Tenure

0 mo

Current Experience

Optum

Senior Platform Engineer

Jun 2026 – Present · 0 mo · Chennai · Hybrid

Platform Engineer | Site Reliability Engineering (SRE)
Responsible for managing and evolving cloud-native platform services, Kubernetes infrastructure, observability platforms, and deployment automation. Focused on improving platform reliability, scalability, operational efficiency, and developer experience through automation, monitoring, and modern platform engineering practices.
Key areas:
Kubernetes Platform Operations
Observability & Monitoring
Platform Automation
CI/CD & GitOps
Incident Management & Reliability Engineering
Infrastructure Modernization

Site Reliability EngineeringServer ArchitecturePlatform Operations

Verizon

2 roles

Engineer Consultant - Software Development

Promoted

Dec 2022 – May 2026 · 3 yrs 5 mos · Hybrid

Senior DevOps/SRE Engineer
Own end-to-end observability and reliability for large-scale distributed microservices running on AWS and Kubernetes, ensuring service health, early incident detection, and production risk visibility.
Build and operate production infrastructure on AWS (EKS, EC2, VPC, IAM, networking, storage) including environment provisioning, platform configuration, and secure service connectivity.
Design and manage external traffic architecture from internet to services using DNS, CDN, load balancers, NGINX ingress, and Kubernetes service routing.
Lead production operations and incident management, including monitoring, on-call response, triage, mitigation, and post-incident root cause analysis.
Define service health using SLO-driven signals such as latency, error rate, and traffic patterns to detect customer-impacting issues early.
Implement baselining and anomaly detection on key telemetry signals to identify abnormal behavior before incidents escalate.
Drive high-signal alerting strategies that reduce alert fatigue while improving Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Provide real-time incident analysis and impact assessment during live production events to guide engineering teams toward faster stabilization.
Own deployment observability, correlating production behavior with CI/CD releases to support safe rollouts and rollback decisions.
Build AI-assisted operational tooling for anomaly detection, telemetry correlation, and incident summarization to improve operational efficiency.
Develop MCP-based interfaces for observability platforms (e.g., OpenSearch, Newrelic, Grafana, Kubernetes) enabling AI systems and engineers to query logs and metrics, analyze production signals, and troubleshoot incidents faster.

AWSKubernetesIncident ManagementObservabilityAI-assisted operational toolingSite Reliability Engineering+1

Engineer - Software Development

Nov 2018 – Nov 2022 · 4 yrs · Hybrid

Provided application and infrastructure support for multiple Verizon Enterprise applications.
Managed and optimized systems including Enterprise Prepaid Systems and Tokenization Gateway.
Collaborated with cross-functional teams to enhance the Reference Data Management System.
Contributed to the Service Management and Account Resource Tool, improving operational efficiency.

New RelicBatch Control