Abhishek Kumar

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India8 yrs 6 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Reduced alert volume by 80% through innovative log-based alerting.
  • Saved $50k in a month via strategic cost optimization.
  • Developed Infrastructure as Code tool for AWS resource management.
Stackforce AI infers this person is a SaaS-focused Site Reliability Engineer with strong DevOps and Cloud Development expertise.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud DevelopmentDevops EngineeringSoftware Engineering

Other Skills

AWSAWS CloudwatchAWS EC2AWS Identity and Access Management (AWS IAM)AWS LambdaAWS S3AWS-CLIAlgorithmsAmazon Web Services (AWS)AnsibleAvailabilityCC++CI/CDComputer Science

About

Please Check: https://abhisoniks.github.io/

Experience

Rubrik, inc.

Site Reliability Engineer

Dec 2020Present · 5 yrs 3 mos · Bengaluru, Karnataka, India

  • Log-based Alerting using NLP and error deduplication: Developed a log-based alerting solution
  • for situations where the error rate is very high, errors are repetitive in nature and not all the errors are
  • actionable. The tool deduplicates logs using NLP, gets unique errors, and creates alerts or reports based on configurable YAML files. The framework reduced the alert volume by 80%. A patent was filed on this idea.
  • Ensuring Availability and Reliability: Ensure 3 9’s availability in terms of site uptime, API uptime, login,
  • and key workflows on UI. Worked with multiple teams to achieve availability by bringing fault tolerance,
  • defining SLA, and building infrastructure to measure availability.
  • Incident management: Act as Incident manager during major outages. Responsibilities include the status page
  • updates, mitigating the issue,
  • updates about outages to stakeholders, and driving a blameless CLA after the outage.
  • Operation Excellence: Ensure operational excellence and consistency among all the components by defining the operational excellence benchmark like bronze, silver, and gold certifications on the basis of the operational efficiency of a team. The effort involves periodic auditing and ensuring the pager duty hygiene, releases hygiene,
  • production readiness, CFDs and incident response, CLA closures, etc for application teams.
  • Observability: Worked with multiple teams to get industry-standard observability into all aspects of the
  • system to identify and fix problems before they become outages. Created key observability dashboards like
  • Top N critical metrics, SLA dashboard, 4 Golden signal dashboard, memory, CPU, etc.
  • Cost Optimization: Worked on various cost optimization projects on AWS and GCP to reduce the cost.
  • Toil Reduction: Reduced the toil in the system by automating the repetitive tasks related to development, infrastructure maintenance, incident management, and on-call.
  • Runbook Automation Framework: Developed a framework to automate the runbooks.
Log-based AlertingNLPError DeduplicationAvailabilityReliabilityIncident Management+7

Qubole

DevOps Engineer

Sep 2019Dec 2020 · 1 yr 3 mos · Greater Bengaluru Area

  • Jenkins Infrastructure Revamp: Migration of Build & Bake automation to Kubernetes cluster.
  • Jenkins infrastructure set-up in automated and versioned controlled fashion using Ansible playbooks, Groovy scripts and terraform configurations.
  • Management of around 30 AWS accounts including centralised account creation using Control Tower, centralised billing using AWS Organization, team wise IAM role based access provisioning using Single-Sign-On and centralized governance using Service Control Policies.
  • Helped the organization in cost-optimization initiatives during Corona pandemic saving around 50k\$ in less than a month. Efforts involve weekend shutdown of internal environments, upgrades of EC2 machines and automated termination of idle resources etc.
  • Provisioning of Automated, version controlled and peer-reviewed process of AWS resource creation using terraform and 3rd party tool named Atlantis
  • Setup full CI/CD pipelines for all supported clouds so that each commit will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
  • Work closely with release managers and QA engineers in all the major releases for all supported clouds. Responsibilities include providing stable infrastructure, delivering first draft of artifacts, assistance with any bottleneck and listing possible improvements to address before next major releases.
  • Python based tool to automate the PR merge process enforcing repo based mandatory approval and valid PR title format
  • Tool to provide commit difference between production and hotfix tags during hotfixes on Azure, GCP and Oracle clouds.
  • Python and Jenkins based tool to cherry-pick and push the commits at release branches.
JenkinsKubernetesAnsibleTerraformAWSCost Optimization+4

Dataxu

Software Engineer

Jul 2017Jul 2019 · 2 yrs · Bengaluru, Karnataka, India

  • Major Projects:
  • 1) An IaC tool that can manage the company's critical AWS resources in a versioned and automated environment by codifying the AWS resources into configurational files. Efforts involve tool development, design, planning, automation, and lead.
  • Tech Uses: Golang, IAM, Security Group, KMS, S3, Jenkins
  • 2) Provisioning streamlined access to AWS resources leveraging Single Sign-on. Efforts involve configuring IAM roles having fine-grained access to AWS services and provisioning a self-served break-glass procedure to assume elevated permissions to handle an emergency situation.
  • Tech Used: IAM Roles, Okta, AWS-CLI, Dynomodb, SSN, SSM
  • 3) Provisioning of spot fleets over on-demand instances for cost optimization.
  • Tech Used: Spot instances, Spot fleets, ASG
  • All projects are/were developed in collaboration with other squad members mostly operating from USA/Uruguay.
Infrastructure as CodeAWSIAMSecurity GroupKMSS3+4

Education

Indian Institute of Technology, Bombay

Master’s Degree — Computer Engineering

Jan 2015Jan 2017

Stackforce found 100+ more professionals with Site Reliability Engineering & Cloud Development

Explore similar profiles based on matching skills and experience