Akarshi Kapoor — SRE (Site Reliability Engineer)

I’m a Lead Site Reliability/Software Engineer with 12+ years of experience building and operating large-scale distributed systems, observability platforms, and backend infrastructure. My work spans Cisco, Netflix, NTT Ltd., Bank of America, Accenture, and TCS, where I’ve led high-impact initiatives across Kubernetes, Kafka/MSK, Terraform, cloud infrastructure, observability, and automation. I specialize in designing reliable, scalable systems and helping teams execute complex roadmaps with high engineering standards. At Cisco, I architected an event-driven AI pipeline on Kafka/MSK and Kubernetes that processes 2B+ of streaming events with sub-second end-to-end latency. I also built a multi-tenant event streaming platform for a US Government project in just 30 days, owning infrastructure, networking, and security end to end. I’ve supported distributed systems operating at P99 latency below 200ms while also leading hiring and team development initiatives. At Netflix, I improved deployment performance by 30%, helped maintain 99.99% uptime during high-traffic conditions, and built AI-assisted reliability tooling that reduced MTTD by 60% and MTTR by 50%. Across earlier roles, I’ve delivered ~50% AWS cost savings, 35% performance improvements, and major reductions in operational toil through automation, cloud modernization, and CI/CD improvements. I’m passionate about building resilient platforms, leading strong engineering teams, and solving hard problems in distributed systems, observability, event-driven architecture, and AI-enabled operations.

Stackforce AI infers this person is a highly skilled Site Reliability Engineer specializing in large-scale distributed systems and cloud infrastructure.

Location: Bengaluru, Karnataka, India

Experience: 11 yrs 10 mos

Skills

Site Reliability Engineering
Kubernetes
Apache Kafka
Infrastructure Management
Artificial Intelligence (ai)
Devops
Cloud Computing

Career Highlights

Architected a billion-event processing pipeline.
Achieved 99.99% uptime during high-traffic conditions.
Delivered significant AWS cost savings through optimization.

Work Experience

Cisco

Lead Site Reliability Engineer (1 yr 8 mos)

Netflix

Senior Site Reliability Engineer (1 yr 10 mos)

NTT

Senior Software Engineer (1 yr 7 mos)

Bank of America

Senior Software Engineer (8 mos)

Accenture

Senior Software Engineer (2 yrs)

Tata Consultancy Services

Software Engineer (4 yrs 1 mo)

Education

B.Sc. Honors at Dayalbagh Educational Institute

MBA at Dayalbagh Educational Institute

Higher Secondary at St. Conrad’s Inter College

Akarshi Kapoor

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India11 yrs 10 mos experience

AI EnabledAI ML Practitioner

Key Highlights

Architected a billion-event processing pipeline.
Achieved 99.99% uptime during high-traffic conditions.
Delivered significant AWS cost savings through optimization.

Stackforce AI infers this person is a highly skilled Site Reliability Engineer specializing in large-scale distributed systems and cloud infrastructure.

Contact

kapoorakarshi@gmail.com LinkedIn

Skills

Core Skills

Site Reliability EngineeringKubernetesApache KafkaInfrastructure ManagementArtificial Intelligence (ai)DevopsCloud Computing

Other Skills

Python (Programming Language)Systems DesignSplunkRetrieval-Augmented Generation (RAG)AIOpsMLOpsKafka/MSKLangChainAWS MSKEKSpgvectorSemantic SearchGitOpsContainerizationTechnical Research

About

Experience

11 yrs 10 mos

Total Experience

2 yrs

Average Tenure

1 yr 8 mos

Current Experience

Cisco

Lead Site Reliability Engineer

Oct 2024 – Present · 1 yr 8 mos · India · Hybrid

Architected an event-driven AI pipeline using Kafka/MSK and LangChain, processing 2B+ streaming events per day across multi-region clusters, enabling real-time LLM-based anomaly detection with <200ms end-to-end latency and 99.99% pipeline availability.
Integrated OpenAI APIs with PostgreSQL (pgvector) to deliver context-aware LLM responses across petabyte-scale enterprise datasets, reducing hallucinations by 45% and improving retrieval precision by 60% for 100K+ daily queries.
Standardised Kubernetes deployments across 200+ microservices using Helm, implementing progressive delivery (canary + blue/green) and automated validation pipelines—reducing blast radius by 95% and increasing deployment success rate to 99.8%.
Architected and delivered a mission-critical, multi-tenant event streaming platform (AWS MSK, EKS) handling 1M+ events/sec peak throughput for a US Government system, successfully deployed within 30 days under strict compliance and uptime requirements.
Designed a RAG-based AI assistant leveraging semantic search (pgvector) over 10M+ indexed documents, enabling sub-second retrieval of incident runbooks—reducing manual triage time by 50% and improving MTTR by 35%.
Operate and scale a distributed event-streaming platform processing 3–5 billion events daily, maintaining P99 latency under 200ms and 99.99% SLA adherence across globally distributed consumers.
Led performance optimization of AWS MSK and EKS clusters supporting multi-terabyte/hour ingestion, improving throughput by 2.5x while maintaining strict SLAs for latency-sensitive downstream systems.
Authored and implemented a Dual-Root Private Certificate Authority (PCA) architecture with active-active cross-region failover, supporting millions of secure service-to-service authentications per hour with zero downtime and eliminating all single points of failure in identity validation.

Python (Programming Language)KubernetesSite Reliability EngineeringApache KafkaSystems DesignSplunk+3

Netflix

Senior Site Reliability Engineer

Nov 2022 – Sep 2024 · 1 yr 10 mos · Australia · On-site

Lead hiring initiatives and professional development for a team of SREs, fostering a culture of ownership and accountability while standardizing high engineering standards via Helm and automated testing.
Partnered with global product teams to align reliability goals for large-scale production services, reducing deployment latency by 30% across multi-cloud environments (AWS/Azure).
Optimized the reliability of AI-driven event streams by implementing backpressure handling and circuit breakers for OpenAI API calls, ensuring 99.99% system uptime during high-traffic bursts.
Defined the strategy for a multi-region monitoring stack, managing Prometheus and Grafana at scale to provide end-to-end visibility for business-critical platforms.
Direct incident response and RCA processes for mission-critical services, leveraging an AI-powered SRE assistant to reduce MTTD by 60% and MTTR by 50%.
Designed Azure Landing Zone foundations to standardize subscriptions, management groups, identity, networking, and policy controls for secure, scalable multi-environment cloud adoption.

GitOpsContainerizationTechnical ResearchSystem ArchitectureSystem PerformanceQueuing+22

Ntt

Senior Software Engineer

Apr 2021 – Nov 2022 · 1 yr 7 mos · Bangalore Urban, Karnataka, India

Built AWS monitoring pipelines using CloudTrail, Lambda, SNS, and S3 to improve operational visibility and reduce downtime.
Automated CI/CD pipelines (GitHub to production) with Github Actions and Ansible, reducing deployment time by 50% and increasing deployment frequency by 40%.
Optimized large-scale data workflows using GKE and BigQuery, achieving a 35% performance uplift while scaling to meet enterprise demand.
Improved Azure Kubernetes Services reliability and security posture through workload isolation, secret management, observability integration, and autoscaling configurations for business-critical services.

Prometheus.ioNon-Functional RequirementsGitOpsContainerizationTechnical ResearchSystem Architecture+37

Bank of america

Senior Software Engineer

Aug 2020 – Apr 2021 · 8 mos · India

Spearhead the technical roadmap for observability and data infrastructure, leading a cross-functional team to migrate enterprise-scale workloads from DataDog to Splunk with 100% data fidelity.
Delivered ~50% AWS cost savings by architecting Graviton migrations and rightsizing MSK clusters, balancing high-throughput performance with fiscal responsibility.
Owned reliability of Azure-based services by defining SLIs, SLOs, alerting thresholds, and operational Runbooks to improve availability and reduce incident response time.

Non-Functional RequirementsGitOpsContainerizationTechnical ResearchSystem ArchitectureAWS CodeDeploy+27

Accenture

Senior Software Engineer

Aug 2018 – Aug 2020 · 2 yrs · India

Engineered certificate lifecycle automation tool, saving 120+ engineering hours annually.
Orchestrated AWS migration using DMS and SCT, improving uptime to 99.99%, boosting performance by 30%, and reducing costs by 25%.

Non-Functional RequirementsGitOpsContainerizationTechnical ResearchSystem ArchitectureAWS CodeDeploy+20

Tata consultancy services

Software Engineer

Jul 2014 – Aug 2018 · 4 yrs 1 mo

Developed Ansible automation for Linux server management, reducing manual workload by 95%.
Built scripts for dynamic cloud instance management, saving $330K annually and 100+ man-hours/year.

Site Reliability Engineering