Sharad Dubey

Engineering Manager

Bengaluru, Karnataka, India10 yrs 6 mos experience

AI EnabledAI ML Practitioner

Key Highlights

Established a modern SRE framework reducing MTTR by 40%
Built AI diagnostic tooling cutting alert investigation time significantly
Created a terraform codebase for automated infrastructure deployments

Stackforce AI infers this person is a Site Reliability Engineer with expertise in cloud-native solutions and automation in SaaS environments.

Contact

sharaddubey70@gmail.com LinkedIn

Skills

Core Skills

Site Reliability EngineeringCloud ComputingAutomationMonitoringSystem Administration

Other Skills

AIAPMAWSAnsibleAppDynamicsArchitectural MaturityAzureAzure DevOps ServerAzure pipelineBashCI/CDConfiguration ManagementContinuous Integration and Continuous Delivery (CI/CD)DatadogDistributed Systems

About

RedHat and AWS certified Site Reliability Engineer and leader with a demonstrated skillset of Linux, DevOps, Cloud Computing, Reliability Engineering, Python, JavaScript and System designing.

Experience

10 yrs 6 mos

Total Experience

1 yr 11 mos

Average Tenure

7 mos

Current Experience

Booking holdings (nasdaq: bkng)

Engineering Manager (SRE & Platform)

Nov 2025 – Present · 7 mos · On-site

Tessell

SRE Manager

Apr 2023 – Nov 2025 · 2 yrs 7 mos

SRE Manager leading a high performing team across OS/Cloud, Databases, and SDEs, focused on platform scalability, distributed systems reliability, and architectural maturity for Tessell’s cloud-native DBaaS platform.
Working closely with Architecture, Product, and Platform Engineering to define capacity models, control-plane design, autoscaling strategies, and reliability patterns, supported by strong understanding of service topology and data flows.
Established a modern SRE framework using SLIs/SLOs, error budgets, and ITSM alignment strengthening reliability governance and reducing MTTR by 40%. Introduced a structured SRE delivery model with sprint discipline and engineering quality gates, improving predictability and platform readiness.
Led architecture, design, and code reviews, enhancing resiliency, observability depth, and regression control to build scalable microservices and distributed systems.
Built AI driven diagnostic tooling that reduced alert investigation time from 30 minutes to under 10, and python based automations for repetitive workflows that cut toil by 35%.
Implemented Ansible based configuration management, streamlining upgrade orchestration across ~1.2k nodes/month, and developed Terraform IaC blueprints for AWS and Azure to standardize deployments and accelerate provisioning. Optimized on-call processes to support 50K+ alerts/quarter, improving signal-to-noise and reducing false positives by 30%.
Drive improvements in alert quality, telemetry pipelines, platform hardening, and security baselines across OS, Cloud, and DB layers to meet enterprise-grade standards. Partnering on capacity planning, resilience engineering, and long-term architectural scaling to support customer growth and platform elasticity.
Emphasizing architectural rigor, automation-first engineering, distributed systems reliability, and developing high-performing teams that deliver resilient, self-healing, and scalable platforms.

SREPlatform ScalabilityDistributed Systems ReliabilityArchitectural MaturitySLIs/SLOsError Budgets+8

Morgan stanley

2 roles

Senior Manager

Jan 2022 – Mar 2023 · 1 yr 2 mos

Building SRE Team: Responsible for building and managing a SRE team to support the settlements platform.
SRE Engagements: Conducting SRE engagements to decide deployment strategies, capacity planning, system designs, and building CI/CD pipelines.
Terraform Codebase Creation: Creating a terraform codebase for the domain to automate infrastructure and application deployments.
Onboarding APM and Infra Observability: Onboarding APM and infra observability for new projects over tools like Datadog, AppDynamics, PagerDuty, and Azure Monitors.
Software Development: Developing software to assist with operations and conducting post-incident reviews.
Automating Processes: Para-trooping in different app teams to enhance automation, reduce manual toil, and fix roadblocks.
Azure Cloud Infrastructure Deployment: Working on infrastructure deployment to Azure cloud using terraform.
Monitoring and Alerting: Setting up Datadog observability via terraform with PagerDuty integration using API keys as alerting.
Code Deployment Automation: Automating code deployment using Jenkins with GIT integration.
Azure Resource Creation using Terraform: Working on the creation of Azure resources using terraform.
Reducing MTTR: Reducing Mean Time To Recovery (MTTR) and defining the right Service Level Indicators (SLI), Service Level Objectives (SLO), and Error budgets.
Log Analysis: Analyzing production server error logs and maintaining documents of reports.
Troubleshooting: Troubleshooting application and system-related issues.

SRETerraformAzureCI/CDAPMObservability+3

Manager (SRE)

Jul 2020 – Dec 2021 · 1 yr 5 mos

Bed bath & beyond

Site Reliability Engineer

Aug 2019 – Jul 2020 · 11 mos · Gurgaon, India

Proficient in implementing continuous integration with Jenkins.
Skilled in configuring Jenkins Master and Slaves to create an efficient build environment.
Expertise in creating Maven-based build pipelines in Jenkins, which includes compilation, code reviewing, testing, and deployment of the code.
Well-versed in installing various plugins in Jenkins to enable integration with version control, configuration management, deployment, and monitoring tools.
Successfully built and managed CI/CD pipelines for projects.
Possess extensive knowledge of GIT and GIT workflows, including common SCM practices such as branching and code merging.
Experience in container orchestration, including Docker container cluster orchestration using Kubernetes/Docker Swarm.
Strong working knowledge of Cloud environments and tools, with additional experience in AWS being a plus.
Experienced in implementing CI/CD pipelines in Microservices architecture.
Provided configuration, maintenance, and testing setup for developers, and expanded integration coverage for software-defined enterprise infrastructure.
Proficient in building Maven-based build environments, and automating continuous builds and deployments in hybrid cloud environments.
Maintained Git repositories for developers, and built, administered, and troubleshooted mission-critical environments such as Production, Stage, Dev, Test, and QA.
Participated in an on-call schedule in the local time zone to address any issues.
Proficient in server monitoring systems such as Nagios.

JenkinsMavenDockerKubernetesAWSSite Reliability Engineering+1

Adobe

Site Reliability Engineer

Sep 2018 – Aug 2019 · 11 mos

Worked with a variety of Amazon Web Services (AWS) such as Elastic Compute Cloud (EC2), Elastic Load Balancer (ELB), Virtual Private Cloud (VPC), Simple Storage Service (S3), CloudFront, Identity and Access Management (IAM), Relational Database Service (RDS), Route 53, and CloudWatch.
Set up and managed VPCs, subnets, and established connections between different zones using VPC peering. Also, blocked suspicious IP/subnets via ACL.
Created and managed Amazon Machine Images (AMIs), snapshots, and volumes. Additionally, created and managed autoscaling configurations.
Configured Route 53, RDS, VPC subnets, route tables, ACLs, and security groups. Also, configured and managed IAM and roles.
Created and managed S3 buckets to store database dumps and logs backups. Set bucket policies as well.
Utilized CloudWatch to monitor EC2, ELB, and Elastic File System (EFS). Configured Route53 with different routing options.
Configured EFS to multiple instances for storage and monitored servers through CloudWatch.
Created playbooks and roles in Ansible to provision/decommission AWS instances across the infrastructure with customized configurations. Also, created playbooks and roles to automate manual tasks and various deployments using Python scripts in Ansible.
Installed and configured host groups, hosts, scripts, and checks over Nagios for monitoring. Monitored production server health on different parameters such as CPU load, physical memory, swap memory, hard disk, services, HTTP service, and response time via Nagios.
Ran queries on Splunk for log analysis, maintained documents of production server error logs, and installed and set up Apache web servers. Troubleshot issues on web servers to maintain site availability.
Installed SSL certificates on Apache and AWS ELBs, managed DNS records over bind and Route53, and created scripts for daily tasks such as performance monitoring, backups, and server logs. Also, resolved boot issues.

AWSAnsibleNagiosSplunkSite Reliability EngineeringCloud Computing

Hcl technologies

Unix Engineer

Aug 2015 – Jul 2018 · 2 yrs 11 mos · Noida Area, India

Performance Monitoring and Kernel Upgrades: This includes monitoring system performance and applying kernel upgrades or patches when necessary.
Package Installation, Management, and Verification: This involves the use of both rpm and yum to install and manage packages, as well as to verify their integrity using a local repository.
Automation of Jobs with CRON Scheduler: Tasks can be automated using the CRON scheduler, which allows for the scheduling of scripts or other tasks to run at specific times or intervals.
User Account Management: This includes creating and customizing user and group accounts.
System Monitoring: This involves monitoring disk status, file systems, system and user processes, memory activity, and network activity.
Memory and Swap Space Management: This includes managing memory and swap space to optimize system performance.
ACL Implementation: Experience with implementing ACLs (Access Control Lists) on both user and group levels.
File and Folder Permissions: This includes setting up file and folder permissions and ownerships, as well as sticky bits, setuid, and setgid.
System Recovery: Ability to perform system recovery during system failures.
Disk Administration: This includes creating partitions, file systems, labeling, and monitoring using fdisk.
LVM Configuration: Configuring and administering LVM (Logical Volume Manager).
File System Creation and Extension: Ability to create and extend file systems as required.
TCPDUMP Creation: Ability to create tcpdumps as per requirements.
File Server Configuration: Configuration of file servers using NFS and troubleshooting any issues that arise.
Ability to configure Linux servers using ILO and perform OS installations.
DNS and FTP Configuration: Experience in configuring DNS, FTP, and managing file systems.
Network Configuration and Troubleshooting: Ability to configure networks and troubleshoot any issues that arise.
Experience with using tar, scp, and rsync tools for backup and restoration purposes.