Atul Aggarwal

DevOps Engineer

Bengaluru, Karnataka, India21 yrs 4 mos experience

Key Highlights

Over 21 years of experience in managing scalable distributed systems.
Proven track record in team initiation and bootstrapping.
Expertise in infrastructure observability and engineering management.

Stackforce AI infers this person is a seasoned Infrastructure and DevOps leader in the SaaS industry.

Contact

Skills

Core Skills

DevopsInfrastructureBlockchainResiliency EngineeringSite Reliability Engineering

Other Skills

Agile MethodologiesAmazon EKSAmazon Web Services (AWS)ApacheApache MesosBackstageCI&CDCI/CDChaos EngineeringChefCloud ComputingComplianceDNSDistributed SystemsDocker

About

Bringing over 21 years of hands-on experience in successfully overseeing and managing highly scalable distributed systems Prioritize infra observability as a top consideration. Proven track record in initiating and bootstrapping teams multiple times. Demonstrated leadership in guiding geographically dispersed teams in large-scale infrastructure, production engineering, DevOps, and operations, fostering close collaboration with product, operations, and business leadership teams. Extensive experience in ensuring production readiness, product strategy, and providing strong engineering management and leadership. Tech Stack: AWS foundational services including EKS , Lambda , Object Storage (S3) , RDS, Global Accelerator, Python, Terraform , Jenkins, GHA, Kong, Karpenter, Prometheus & Grafana,Datadog , New Relic, Coralogix , Chaos Toolkit

Experience

21 yrs 4 mos

Total Experience

2 yrs 6 mos

Average Tenure

8 mos

Current Experience

Mhp – a porsche company

Senior Engineering Manager - Devops

Sep 2025 – Present · 8 mos

Acko

Director of DevOps

Nov 2024 – Jul 2025 · 8 mos · Bengaluru, Karnataka, India · On-site

Led the Acko infrastructure team, leveraged central elements like EKS , Karpenter to
build a robust, cost-optimized platform.
Drove significant improvements in system uptime and incident response, while
simultaneously achieving cost optimization and accelerating deployment cycles.
Drove a strategic shift towards self-service deployment/provisioning workflows.
Developed an automation strategy using Backstage to enable developers to provision and deploy to staging environments autonomously, while maintaining strict control and governance through a system of guardrails.
Collaborated with the Internal compliance team and executed IRDAI Infrastructure compliance audits, successfully meeting all regulatory requirements.
Introduced a culture of continuous improvement through proactive game-day planning and execution. Identified and addressed infrastructure and deployment maturity gaps including mature runbooks and monitoring.
Mentored and upskilled the thin infrastructure team, setting a clear roadmap for long-term goals including service mesh implementation , GHA migration plan and cost observability.

Python (Programming Language)Amazon EKSTerraformService MeshJenkinsBackstage+7

Coinbase

Engineering Manager - Infrastructure

Apr 2022 – Nov 2024 · 2 yrs 7 mos · Bengaluru, Karnataka, India · Remote

Led a multi-pod team of infrastructure engineers, overseeing the design, scaling, and
management of the core platform, with a focus on developer productivity and
regulatory compliance.
Drove a critical initiative to enhance blockchain node monitoring, developing the first
draft of an automated system to detect and alert on network transaction sync delays, ensuring platform stability for self-hosted crypto assets. Leveraged terraform datadog provider for dashboard ad alerting setup.
Collaborated with engineering and compliance teams to lead quarterly regulatory exercises, ensuring the infrastructure met all a company's strict compliance standards.
Partnered with engineering stakeholders to drive global impact, defining key needs and building observability and automation solutions that improved both platform reliability and customer experience including contractors engagements.

InfrastructureResiliencyAmazon Web Services (AWS)DevOpsDocker ProductsMicrosoft Azure+8

Atlassian

Engineering Manager - Platform SRE

Feb 2020 – Mar 2022 · 2 yrs 1 mo · Bengaluru, Karnataka, India

Led and scaled a 12-member Software Engineering team focused on building a resilience framework that enhances the resilience and stability of the core platform, leading to a more stable and positive customer experience.
Authored the foundational design for a new resilience framework, conducting in-depth research and a build-buy analysis of open source tools such as Chaos toolkit and Litmus. This strategic work led to the development of an automated PaaS offering, enabling the team to execute continuous resilience testing across all critical customer-facing services in pre-prod setup.
Drove the resilience adoption - "Everyday Resilience Engineering", transforming reactive incident response into proactive, continuous improvement efforts to enhance platform stability.

ResiliencyAmazon Web Services (AWS)Docker ProductsChaos EngineeringTerraformcontinous verification+5

Ola (ani technologies pvt. ltd)

Senior Engineering Manager - SRE

Feb 2017 – Feb 2020 · 3 yrs

Led the infrastructure team overseeing the design, scaling, and management of the core platform including mesos marathon orchestration , haproxy , kong.
Managed and mentored a 20+ member SRE team, including direct reports and staff engineers, with direct accountability for platform reliability, incident response, and change management in high-scale environments.
Pioneered the SRE function for new product launches, ensuring operational readiness, high uptime, and scalable deployment pipelines.
Owned the Observability initiative, leveraging the USE and RED models to build a robust system for infrastructure insights. Implemented advanced monitoring, alerting, and logging solutions to fortify the production environment.
Drove SRE strategy for a major cloud migration from AWS to Azure, overseeing the planning, execution, and validation of the migration to ensure seamless service continuity and operational excellence.
Drove a culture of continuous improvement, reducing incident response times by enhancing detection and response playbooks for highly critical environments using stackstorm automation framework.

Docker ProductsPython (Programming Language)Site Reliability EngineeringInfrastructure

3 roles

Manager, SRE

Mar 2014 – Feb 2017 · 2 yrs 11 mos · Bangalore

Initiated and led a team of 7 engineers.
As part of the SRE group, manage a highly scalable DATA system, overseeing petabytes of data, thousands of ETL jobs, and data movement across clusters, including cross-colo and data ingestion pipelines.
Collaborate closely with core development, product, and business teams to tailor solutions on the Infra/operations side. Deliver critical SLA reports to executives, leveraging extensive Hadoop and data warehouse skill sets.
Tackle exciting challenges in designing and running systems at an incredible scale, handling petabyte-scale data and billions of events per day across multiple data centers."

Python (Programming Language)

Sr. Datawarehouse Operations Engineer

Sep 2012 – Mar 2014 · 1 yr 6 mos · Bangalore

Python (Programming Language)

Datawarehouse Operations Engineer

Sep 2011 – Sep 2012 · 1 yr · Bangalore

Python (Programming Language)

Yahoo

Senior Systems Engineer

Nov 2008 – Sep 2011 · 2 yrs 10 mos · Bangalore

Environment : LINUX/FreeBSD/Perl
Role:
Working on ETL (Extract, transform, and load) tools and responsible for smooth functioning of all components (Warehouse, Scheduler {Moab}).
Data Highway (Real time Data Collection of web events ) - The Data Highway Platform enables real time data collection and processing within Yahoo. The largest Data Highway deployment spans tens of thousands of nodes over 20+ data centers and carries over 12 terabytes of data per day.
Working closely with system analysts in identifying areas requires automation.
Build monitoring tools for proactive maintenance.
Coordinating with developers and provide enhancement request wherever required.
Tracking bugs in Bugzilla and coordinating with development team in fixing the bugs.
Taking part in capacity planning related activities.
Working knowledge on Data Mining/Pipeline Techniques.

Hcl technologies

Lead Engineer

Sep 2006 – Oct 2008 · 2 yrs 1 mo

Responsibility :
Troubleshooting at level 2
Build monitoring tools for proactive maintenance
Handling support tasks and ensures minimum response time
PROJECT DETAILS
Project : Onsite Engineer (System Administrator)
Environment : LINUX/UNIX/Perl
Role:
Responsible for managing AAA, ldap, DNS, Mail servers.
Troubleshoot/debug issues at level 2 if not get sorted by onsite engineers.
Project : IOBAS (Inter Operator Billing and Accounting System) for BSNL
Environment : LINUX/UNIX/Perl
Role:
Build automation for IOBAS processes
Project : Comverse-Service Platform
Environment : LINUX/UNIX/Perl
Role:
Analyze client requirement and develop scripts for proactive maintenance of their products.

Elitecore technologies pvt ltd.

NETWORK AND SYSTEM ENGINEER

Sep 2004 – Sep 2006 · 2 yrs

Responsibility :
Providing backend operations and ensuring smooth functioning of over 750 + installations of Elitecore's products (Cyberoam, 24online)
Collecting and analyzing information about client's network for smooth deployment of product to ensure minimum downtime and changes in the network.
Troubleshooting and solving system and network related problems (like routing and NAT issues at the router end and internal connectivity from server towards LAN) to ensure smooth operations at post installation stages.
Responsible for managing entire National & International 24Online technical support.
Handling of support tickets and ensures minimum response time and client's satisfaction
PROJECT DETAILS
Project : Implementation of Cyberoam & 24online
Cyberoam is Unified Threat Management software provides largely being deployed in corporate and educational institutes. 24Online act as ACCESS gateway as well as Radius is the complete billing and bandwidth management solution that enables broadband, dialup and WIFI service providers, hotels, hotspots, and cafes.
Environment : LINUX
Role:
Collecting and analyzing information about client's network and ensure the deployment of product with minimum changes and downtime in the network at front end at various organizations. And at backend providing support for various applications running with the Cyberoam like MYSQL, IPTABLES, qmail at all stages of the project.
Implementation of 24 Online in different scenarios in various organizations in wireless and wired networks, enabling NAS and PPPoE client authentication configurations and Clarifying the user, package management or other related queries of the client.