Akashdeep Thakur

DevOps Engineer

Bengaluru, Karnataka, India14 yrs 3 mos experience

Highly Stable

Key Highlights

Led SRE initiatives for multi-region infrastructure.
Achieved 30% cost savings through optimization.
Built high-performing teams focused on reliability.

Stackforce AI infers this person is a DevOps and Site Reliability Engineering expert in SaaS.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud ArchitectureAutomationIncident ResponseMonitoringCost GovernanceCi/cd AutomationComplianceInfrastructure As CodeDevops Automation

Other Skills

API DevelopmentAWSAWS EKSActiveMQAgile MethodologiesAmazon Web Services (AWS)AnsibleArgoCDArtifactoryAzureBashCapacity PlanningChefContinuous IntegrationContinuous deployment

About

I’m a DevOps and Site Reliability Engineering leader passionate about building platforms that combine resilience, scalability, and simplicity. At Adobe, I lead initiatives that modernize large-scale, multi-region infrastructure—from EKS-based microservices and Kafka streaming to observability, automation, and cost optimization—helping teams ship faster and operate smarter. Over the years, I’ve built and mentored high-performing teams that embrace ownership and engineering excellence. My focus is on designing systems that self-heal, self-measure, and continuously improve, ensuring reliability becomes a shared mindset across development, QA, and operations. Core strengths: Cloud architecture (AWS EKS, EC2, RDS, MSK, ElastiCache), Kubernetes & Helm, Terraform & Terragrunt, CI/CD Automation, Monitoring & Observability (Prometheus, Grafana, New Relic, Splunk), Cost Governance. Leadership focus: Mentorship, Team Enablement, Process Automation, Cross-Functional Collaboration, Continuous Improvement. I believe great infrastructure quietly enables great products—and great teams turn reliability into culture, not just metrics.

Experience

Adobe

Software Development Engineer

Sep 2024 – Present · 1 yr 6 mos · Noida, Uttar Pradesh, India · Hybrid

Leading reliability, observability, and automation strategy for Adobe Connect across three AWS regions (NA, EMEA, APAC), ensuring high availability and predictable scaling for 200+ services.
Defined SRE standards and reliability KPIs to align engineering and operations around shared SLIs/SLOs.
Designed and rolled out a unified GitOps and Helm-based deployment model, improving release consistency by 40%.
Built and mentored a DevOps team focused on Kubernetes, observability, and CI/CD automation.
Architected Prometheus–Thanos–Grafana stack and migrated legacy Nagios monitoring into modern metrics-based alerting.
Delivered AWS cost savings of 30% through workload right-sizing and optimization.
Partnered with engineering leadership to strengthen on-call readiness and post-incident review culture.
Key Projects & Tools (2023–2025)
AI-Driven Log Analysis (Ollama + LLMs): Built an internal AI-based log analysis engine using Ollama to summarize large-scale service logs, detect anomalies, and provide incident RCA insights via Slack.
Incident Management Bot: Developed a Slack-based assistant integrating Prometheus and New Relic alerts with Jira automation for faster triage.
Prometheus Exporter for Analytics Platform: Designed Python-based custom exporters for Connect Analytics metrics, enabling unified observability across EKS clusters.
AWS Cost Optimization Toolkit: Automated underutilized resource detection (EC2, EBS, ElastiCache), achieving ~30% infrastructure cost savings.
GitOps Bootstrap Automation: Implemented ArgoCD “app-of-apps” pattern with layered Helm values and Secrets Manager integration for multi-region deployments.
Global Observability Platform (Prometheus + Thanos): Architected federated metrics ingestion from regional Prometheus instances into Thanos for global analytics.
SOC2 Compliance Auditor: Built policy drift and configuration scanner validating IAM, S3, and VPC rules, automating SOC2 audit readiness.

AWS EKSKubernetesHelmTerraformCI/CD AutomationMonitoring+4

Paytm

Principal Engineer

Apr 2023 – Sep 2024 · 1 yr 5 mos · Hybrid

Directed the DevOps and reliability function for Paytm’s fintech infrastructure, aligning engineering execution with regulatory, performance, and availability goals.
Led a 5-member SRE team responsible for infrastructure scaling, Kubernetes upgrades, and multi-region resiliency.
Improved uptime and reduced incident frequency by 25% through structured release governance and observability adoption.
Implemented Infrastructure as Code (Terraform, ArgoCD), increasing environment consistency and reducing deployment times.
Introduced service mesh (Istiod) and automated scaling (KEDA, Cluster Autoscaler) to optimize system elasticity.
Enhanced security posture with Prisma Cloud integration and continuous compliance for RBI/SOC2 audits.
Mentored engineers in SRE practices, incident management, and cloud cost engineering.

TerraformArgoCDKubernetesIncident ResponseCost EngineeringSite Reliability Engineering+1

Informatica

2 roles

Principal DevOps engineer

Promoted

Mar 2021 – Apr 2023 · 2 yrs 1 mo

Owned automation, reliability, and infrastructure modernization across Informatica’s cloud-native products.
Scaled CI/CD and cloud automation frameworks spanning AWS and Azure, supporting multiple business units.
Improved deployment frequency by 3x through Kubernetes adoption and GitOps-based delivery pipelines.
Led disaster recovery and performance engineering efforts, achieving sub-5-minute RTO and automated failover.
Created internal SRE knowledge base and led hands-on mentoring for DevOps engineers across teams.

AWSAzureKubernetesGitOpsDevOps AutomationSite Reliability Engineering

Lead DevOps Engineer

May 2018 – Mar 2021 · 2 yrs 10 mos

Automated AWS provisioning using Terraform and Python (boto3), improving deployment efficiency and reducing manual effort.
Migrated databases (MySQL, MongoDB → PostgreSQL) with minimal downtime and zero data loss.
Managed Kubernetes migrations and Azure VMSS-based zero-downtime deployments.
Designed and executed DR plans, ensuring operational resilience and continuity.
Delivered 15% cost savings through resource optimization and environment standardization.

TerraformPythonKubernetesDevOps AutomationSite Reliability Engineering

Hewlett packard enterprise

Devops Engineer

Dec 2016 – May 2018 · 1 yr 5 mos · Bengaluru Area, India

Designed automated CI/CD pipelines using Jenkins, Chef, and Ansible to improve release reliability.
Deployed and managed scalable, fault-tolerant AWS environments for enterprise information governance products.
Migrated legacy on-prem applications to AWS, introducing cost visibility and resource tracking frameworks.

JenkinsChefAnsibleAWSDevOps Automation

Cisco

Devops Engineer

Oct 2015 – Nov 2016 · 1 yr 1 mo · Bangaluru

Managed CI/CD ecosystem using Jenkins, Artifactory, Docker, and Ansible.
Built reusable containerized environments, cutting setup time for demo and test systems by 70%.
Integrated monitoring with New Relic and established consistent deployment governance policies.

JenkinsDockerAnsibleDevOps Automation

Qasource

software engineer

Nov 2011 – Oct 2015 · 3 yrs 11 mos

Automated server configuration using Ansible and Chef, improving reliability of QA and production environments.
Deployed monitoring and alerting via New Relic and PagerDuty to support 24x7 application uptime.
Administered cloud workloads (AWS EC2, RDS, S3, IAM) and web servers (Apache, Nginx, HAProxy).
Supported end-to-end release pipelines across development, QA, and production environments.

AnsibleChefAWSSite Reliability Engineering