Mariswaran R

SRE (Site Reliability Engineer)

Chennai, Tamil Nadu, India10 yrs 6 mos experience
AI ML PractitionerAI Enabled

Key Highlights

  • 11+ years in Site Reliability Engineering and DevOps.
  • Expert in AI-driven operational automation and observability.
  • Proven track record in managing large-scale distributed systems.
Stackforce AI infers this person is a SaaS expert with a focus on Site Reliability Engineering and AI-driven operations.

Contact

Skills

Core Skills

Site Reliability EngineeringPlatform OperationsAws Cloud InfrastructureAiopsObservability EngineeringMcp Server Development

Other Skills

Server ArchitectureAWSKubernetesIncident ManagementObservabilityAI-assisted operational toolingAnomaly detectionTelemetry correlationMCPOpenSearchGrafanaNew RelicBatch ControlBashProblem Solving

About

DevOps/Site Reliability Engineer with 11+ years of experience building and operating large-scale distributed systems across cloud-native environments. I specialize in Reliability Engineering, Observability, Platform Operations, and AI-driven operational automation. My experience spans AWS, Kubernetes (EKS), OpenSearch, New Relic, Dynatrace, Grafana, Open Telemetry, CI/CD, and large-scale telemetry platforms supporting 1000+ of microservices in production. I focus on improving system reliability through observability, automation, and intelligent operational workflows. My work includes designing monitoring and alerting strategies, driving incident response and root cause analysis, implementing SLO-based reliability practices, and enabling engineering teams to make faster operational decisions. Recently, I have been building MCP (Model Context Protocol) servers and AI-powered operational platforms that unify data across observability, cloud infrastructure, deployment pipelines, and production systems. These platforms provide engineers with a single interface to query operational data, investigate incidents, correlate telemetry, and accelerate troubleshooting using AI-assisted workflows. I am also developing AIOps solutions using machine learning and LLMs for anomaly detection, telemetry correlation, incident summarization, and operational decision support, helping reduce time to detection and time to resolution during production events. Core Expertise: • Site Reliability Engineering (SRE) • Observability Engineering (Metrics, Logs, Traces, APM) • MCP Server Development & AI Agent Integrations • AIOps, ML & LLM-powered Operations • Kubernetes Platform Operations (EKS) • AWS Cloud Infrastructure • OpenSearch, Dynatrace, New Relic and Grafana, Catchpoint • Incident Management & Root Cause Analysis • SLOs, SLIs & Error Budget Governance • CI/CD & Production Release Engineering • Telemetry Pipelines & Monitoring Platforms Currently exploring the future of AI-assisted operations by combining MCP, observability platforms, and operational intelligence to build faster, more reliable, and context-aware engineering workflows. Open to Senior SRE, Staff SRE, Observability Engineering, Reliability Platform Engineering, and AI for Operations opportunities.

Experience

10 yrs 6 mos
Total Experience
3 yrs 6 mos
Average Tenure
0 mo
Current Experience

Optum

Senior Platform Engineer

Jun 2026Present · 0 mo · Chennai · Hybrid

  • Platform Engineer | Site Reliability Engineering (SRE)
  • Responsible for managing and evolving cloud-native platform services, Kubernetes infrastructure, observability platforms, and deployment automation. Focused on improving platform reliability, scalability, operational efficiency, and developer experience through automation, monitoring, and modern platform engineering practices.
  • Key areas:
  • Kubernetes Platform Operations
  • Observability & Monitoring
  • Platform Automation
  • CI/CD & GitOps
  • Incident Management & Reliability Engineering
  • Infrastructure Modernization
Site Reliability EngineeringServer ArchitecturePlatform Operations

Verizon

2 roles

Engineer Consultant - Software Development

Promoted

Dec 2022May 2026 · 3 yrs 5 mos · Hybrid

  • Senior DevOps/SRE Engineer
  • Own end-to-end observability and reliability for large-scale distributed microservices running on AWS and Kubernetes, ensuring service health, early incident detection, and production risk visibility.
  • Build and operate production infrastructure on AWS (EKS, EC2, VPC, IAM, networking, storage) including environment provisioning, platform configuration, and secure service connectivity.
  • Design and manage external traffic architecture from internet to services using DNS, CDN, load balancers, NGINX ingress, and Kubernetes service routing.
  • Lead production operations and incident management, including monitoring, on-call response, triage, mitigation, and post-incident root cause analysis.
  • Define service health using SLO-driven signals such as latency, error rate, and traffic patterns to detect customer-impacting issues early.
  • Implement baselining and anomaly detection on key telemetry signals to identify abnormal behavior before incidents escalate.
  • Drive high-signal alerting strategies that reduce alert fatigue while improving Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
  • Provide real-time incident analysis and impact assessment during live production events to guide engineering teams toward faster stabilization.
  • Own deployment observability, correlating production behavior with CI/CD releases to support safe rollouts and rollback decisions.
  • Build AI-assisted operational tooling for anomaly detection, telemetry correlation, and incident summarization to improve operational efficiency.
  • Develop MCP-based interfaces for observability platforms (e.g., OpenSearch, Newrelic, Grafana, Kubernetes) enabling AI systems and engineers to query logs and metrics, analyze production signals, and troubleshoot incidents faster.
AWSKubernetesIncident ManagementObservabilityAI-assisted operational toolingSite Reliability Engineering+1

Engineer - Software Development

Nov 2018Nov 2022 · 4 yrs · Hybrid

  • Provided application and infrastructure support for multiple Verizon Enterprise applications.
  • Managed and optimized systems including Enterprise Prepaid Systems and Tokenization Gateway.
  • Collaborated with cross-functional teams to enhance the Reference Data Management System.
  • Contributed to the Service Management and Account Resource Tool, improving operational efficiency.
New RelicBatch Control

Sears holdings india

Technical Associate

Dec 2017Aug 2018 · 8 mos · Pune, Maharashtra, India · On-site

Batch ControlBash

Tata consultancy services

2 roles

System Engineer

Apr 2016Dec 2017 · 1 yr 8 mos · On-site

  • Vodafone Unified Prepaid Support System
BashProblem Solving

Assistant System Engineer

Jun 2015Mar 2016 · 9 mos · On-site

  • Vodafone Unified Prepaid Support System
BashProblem Solving

Education

Anna University Chennai

Bachelor of Engineering - BE — ENGINEERING

Aug 2011Apr 2015

State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)

Mathematics

Jun 2008Jun 2009

State Board School Examinations (Sec.) & Board of Higher Secondary Examinations, Tamil Nadu (TNSB)

Mathematics and Computer Science

Jun 2010Apr 2011

Stackforce found 100+ more professionals with Site Reliability Engineering & Platform Operations

Explore similar profiles based on matching skills and experience