Akashdeep Thakur

DevOps Engineer

Bengaluru, Karnataka, India14 yrs 3 mos experience
Highly Stable

Key Highlights

  • Led SRE initiatives for multi-region infrastructure.
  • Achieved 30% cost savings through optimization.
  • Built high-performing teams focused on reliability.
Stackforce AI infers this person is a DevOps and Site Reliability Engineering expert in SaaS.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud ArchitectureAutomationIncident ResponseMonitoringCost GovernanceCi/cd AutomationComplianceInfrastructure As CodeDevops Automation

Other Skills

API DevelopmentAWSAWS EKSActiveMQAgile MethodologiesAmazon Web Services (AWS)AnsibleArgoCDArtifactoryAzureBashCapacity PlanningChefContinuous IntegrationContinuous deployment

About

I’m a DevOps and Site Reliability Engineering leader passionate about building platforms that combine resilience, scalability, and simplicity. At Adobe, I lead initiatives that modernize large-scale, multi-region infrastructure—from EKS-based microservices and Kafka streaming to observability, automation, and cost optimization—helping teams ship faster and operate smarter. Over the years, I’ve built and mentored high-performing teams that embrace ownership and engineering excellence. My focus is on designing systems that self-heal, self-measure, and continuously improve, ensuring reliability becomes a shared mindset across development, QA, and operations. Core strengths: Cloud architecture (AWS EKS, EC2, RDS, MSK, ElastiCache), Kubernetes & Helm, Terraform & Terragrunt, CI/CD Automation, Monitoring & Observability (Prometheus, Grafana, New Relic, Splunk), Cost Governance. Leadership focus: Mentorship, Team Enablement, Process Automation, Cross-Functional Collaboration, Continuous Improvement. I believe great infrastructure quietly enables great products—and great teams turn reliability into culture, not just metrics.

Experience

Adobe

Software Development Engineer

Sep 2024Present · 1 yr 6 mos · Noida, Uttar Pradesh, India · Hybrid

  • Leading reliability, observability, and automation strategy for Adobe Connect across three AWS regions (NA, EMEA, APAC), ensuring high availability and predictable scaling for 200+ services.
  • Defined SRE standards and reliability KPIs to align engineering and operations around shared SLIs/SLOs.
  • Designed and rolled out a unified GitOps and Helm-based deployment model, improving release consistency by 40%.
  • Built and mentored a DevOps team focused on Kubernetes, observability, and CI/CD automation.
  • Architected Prometheus–Thanos–Grafana stack and migrated legacy Nagios monitoring into modern metrics-based alerting.
  • Delivered AWS cost savings of 30% through workload right-sizing and optimization.
  • Partnered with engineering leadership to strengthen on-call readiness and post-incident review culture.
  • Key Projects & Tools (2023–2025)
  • AI-Driven Log Analysis (Ollama + LLMs): Built an internal AI-based log analysis engine using Ollama to summarize large-scale service logs, detect anomalies, and provide incident RCA insights via Slack.
  • Incident Management Bot: Developed a Slack-based assistant integrating Prometheus and New Relic alerts with Jira automation for faster triage.
  • Prometheus Exporter for Analytics Platform: Designed Python-based custom exporters for Connect Analytics metrics, enabling unified observability across EKS clusters.
  • AWS Cost Optimization Toolkit: Automated underutilized resource detection (EC2, EBS, ElastiCache), achieving ~30% infrastructure cost savings.
  • GitOps Bootstrap Automation: Implemented ArgoCD “app-of-apps” pattern with layered Helm values and Secrets Manager integration for multi-region deployments.
  • Global Observability Platform (Prometheus + Thanos): Architected federated metrics ingestion from regional Prometheus instances into Thanos for global analytics.
  • SOC2 Compliance Auditor: Built policy drift and configuration scanner validating IAM, S3, and VPC rules, automating SOC2 audit readiness.
AWS EKSKubernetesHelmTerraformCI/CD AutomationMonitoring+4

Paytm

Principal Engineer

Apr 2023Sep 2024 · 1 yr 5 mos · Hybrid

  • Directed the DevOps and reliability function for Paytm’s fintech infrastructure, aligning engineering execution with regulatory, performance, and availability goals.
  • Led a 5-member SRE team responsible for infrastructure scaling, Kubernetes upgrades, and multi-region resiliency.
  • Improved uptime and reduced incident frequency by 25% through structured release governance and observability adoption.
  • Implemented Infrastructure as Code (Terraform, ArgoCD), increasing environment consistency and reducing deployment times.
  • Introduced service mesh (Istiod) and automated scaling (KEDA, Cluster Autoscaler) to optimize system elasticity.
  • Enhanced security posture with Prisma Cloud integration and continuous compliance for RBI/SOC2 audits.
  • Mentored engineers in SRE practices, incident management, and cloud cost engineering.
TerraformArgoCDKubernetesIncident ResponseCost EngineeringSite Reliability Engineering+1

Informatica

2 roles

Principal DevOps engineer

Promoted

Mar 2021Apr 2023 · 2 yrs 1 mo

  • Owned automation, reliability, and infrastructure modernization across Informatica’s cloud-native products.
  • Scaled CI/CD and cloud automation frameworks spanning AWS and Azure, supporting multiple business units.
  • Improved deployment frequency by 3x through Kubernetes adoption and GitOps-based delivery pipelines.
  • Led disaster recovery and performance engineering efforts, achieving sub-5-minute RTO and automated failover.
  • Created internal SRE knowledge base and led hands-on mentoring for DevOps engineers across teams.
AWSAzureKubernetesGitOpsDevOps AutomationSite Reliability Engineering

Lead DevOps Engineer

May 2018Mar 2021 · 2 yrs 10 mos

  • Automated AWS provisioning using Terraform and Python (boto3), improving deployment efficiency and reducing manual effort.
  • Migrated databases (MySQL, MongoDB → PostgreSQL) with minimal downtime and zero data loss.
  • Managed Kubernetes migrations and Azure VMSS-based zero-downtime deployments.
  • Designed and executed DR plans, ensuring operational resilience and continuity.
  • Delivered 15% cost savings through resource optimization and environment standardization.
TerraformPythonKubernetesDevOps AutomationSite Reliability Engineering

Hewlett packard enterprise

Devops Engineer

Dec 2016May 2018 · 1 yr 5 mos · Bengaluru Area, India

  • Designed automated CI/CD pipelines using Jenkins, Chef, and Ansible to improve release reliability.
  • Deployed and managed scalable, fault-tolerant AWS environments for enterprise information governance products.
  • Migrated legacy on-prem applications to AWS, introducing cost visibility and resource tracking frameworks.
JenkinsChefAnsibleAWSDevOps Automation

Cisco

Devops Engineer

Oct 2015Nov 2016 · 1 yr 1 mo · Bangaluru

  • Managed CI/CD ecosystem using Jenkins, Artifactory, Docker, and Ansible.
  • Built reusable containerized environments, cutting setup time for demo and test systems by 70%.
  • Integrated monitoring with New Relic and established consistent deployment governance policies.
JenkinsDockerAnsibleDevOps Automation

Qasource

software engineer

Nov 2011Oct 2015 · 3 yrs 11 mos

  • Automated server configuration using Ansible and Chef, improving reliability of QA and production environments.
  • Deployed monitoring and alerting via New Relic and PagerDuty to support 24x7 application uptime.
  • Administered cloud workloads (AWS EC2, RDS, S3, IAM) and web servers (Apache, Nginx, HAProxy).
  • Supported end-to-end release pipelines across development, QA, and production environments.
AnsibleChefAWSSite Reliability Engineering

Education

Himachal Pradesh University

Bachelor's degree — Computer Science

Jan 2007Jan 2011

DAV school Hamirpur,Himachal

High School/Secondary Diplomas and Certificates

Jan 2005Jan 2007

Stackforce found 100+ more professionals with Site Reliability Engineering & Cloud Architecture

Explore similar profiles based on matching skills and experience