Abhishek Kumar

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India8 yrs 6 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

Reduced alert volume by 80% through innovative log-based alerting.
Saved $50k in a month via strategic cost optimization.
Developed Infrastructure as Code tool for AWS resource management.

Stackforce AI infers this person is a SaaS-focused Site Reliability Engineer with strong DevOps and Cloud Development expertise.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud DevelopmentDevops EngineeringSoftware Engineering

Other Skills

AWSAWS CloudwatchAWS EC2AWS Identity and Access Management (AWS IAM)AWS LambdaAWS S3AWS-CLIAlgorithmsAmazon Web Services (AWS)AnsibleAvailabilityCC++CI/CDComputer Science

About

Please Check: https://abhisoniks.github.io/

Experience

Rubrik, inc.

Site Reliability Engineer

Dec 2020 – Present · 5 yrs 3 mos · Bengaluru, Karnataka, India

Log-based Alerting using NLP and error deduplication: Developed a log-based alerting solution
for situations where the error rate is very high, errors are repetitive in nature and not all the errors are
actionable. The tool deduplicates logs using NLP, gets unique errors, and creates alerts or reports based on configurable YAML files. The framework reduced the alert volume by 80%. A patent was filed on this idea.
Ensuring Availability and Reliability: Ensure 3 9’s availability in terms of site uptime, API uptime, login,
and key workflows on UI. Worked with multiple teams to achieve availability by bringing fault tolerance,
defining SLA, and building infrastructure to measure availability.
Incident management: Act as Incident manager during major outages. Responsibilities include the status page
updates, mitigating the issue,
updates about outages to stakeholders, and driving a blameless CLA after the outage.
Operation Excellence: Ensure operational excellence and consistency among all the components by defining the operational excellence benchmark like bronze, silver, and gold certifications on the basis of the operational efficiency of a team. The effort involves periodic auditing and ensuring the pager duty hygiene, releases hygiene,
production readiness, CFDs and incident response, CLA closures, etc for application teams.
Observability: Worked with multiple teams to get industry-standard observability into all aspects of the
system to identify and fix problems before they become outages. Created key observability dashboards like
Top N critical metrics, SLA dashboard, 4 Golden signal dashboard, memory, CPU, etc.
Cost Optimization: Worked on various cost optimization projects on AWS and GCP to reduce the cost.
Toil Reduction: Reduced the toil in the system by automating the repetitive tasks related to development, infrastructure maintenance, incident management, and on-call.
Runbook Automation Framework: Developed a framework to automate the runbooks.

Log-based AlertingNLPError DeduplicationAvailabilityReliabilityIncident Management+7

Qubole

DevOps Engineer

Sep 2019 – Dec 2020 · 1 yr 3 mos · Greater Bengaluru Area

Jenkins Infrastructure Revamp: Migration of Build & Bake automation to Kubernetes cluster.
Jenkins infrastructure set-up in automated and versioned controlled fashion using Ansible playbooks, Groovy scripts and terraform configurations.
Management of around 30 AWS accounts including centralised account creation using Control Tower, centralised billing using AWS Organization, team wise IAM role based access provisioning using Single-Sign-On and centralized governance using Service Control Policies.
Helped the organization in cost-optimization initiatives during Corona pandemic saving around 50k\$ in less than a month. Efforts involve weekend shutdown of internal environments, upgrades of EC2 machines and automated termination of idle resources etc.
Provisioning of Automated, version controlled and peer-reviewed process of AWS resource creation using terraform and 3rd party tool named Atlantis
Setup full CI/CD pipelines for all supported clouds so that each commit will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
Work closely with release managers and QA engineers in all the major releases for all supported clouds. Responsibilities include providing stable infrastructure, delivering first draft of artifacts, assistance with any bottleneck and listing possible improvements to address before next major releases.
Python based tool to automate the PR merge process enforcing repo based mandatory approval and valid PR title format
Tool to provide commit difference between production and hotfix tags during hotfixes on Azure, GCP and Oracle clouds.
Python and Jenkins based tool to cherry-pick and push the commits at release branches.

JenkinsKubernetesAnsibleTerraformAWSCost Optimization+4

Dataxu

Software Engineer

Jul 2017 – Jul 2019 · 2 yrs · Bengaluru, Karnataka, India

Major Projects:
1) An IaC tool that can manage the company's critical AWS resources in a versioned and automated environment by codifying the AWS resources into configurational files. Efforts involve tool development, design, planning, automation, and lead.
Tech Uses: Golang, IAM, Security Group, KMS, S3, Jenkins
2) Provisioning streamlined access to AWS resources leveraging Single Sign-on. Efforts involve configuring IAM roles having fine-grained access to AWS services and provisioning a self-served break-glass procedure to assume elevated permissions to handle an emergency situation.
Tech Used: IAM Roles, Okta, AWS-CLI, Dynomodb, SSN, SSM
3) Provisioning of spot fleets over on-demand instances for cost optimization.
Tech Used: Spot instances, Spot fleets, ASG
All projects are/were developed in collaboration with other squad members mostly operating from USA/Uruguay.