G

Gautam Somani

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India15 yrs experience
AI EnabledAI ML Practitioner

Key Highlights

  • Over 8 years of experience in Site Reliability Engineering.
  • Expert in automating CI/CD processes and infrastructure management.
  • Proven track record in maintaining high uptime for critical services.
Stackforce AI infers this person is a Site Reliability Engineer with expertise in Fintech and E-commerce infrastructure.

Contact

Skills

Core Skills

Site Reliability EngineeringAutomationTechnical SupportCloud InfrastructureDistributed SystemsOperations EngineeringDatabase ManagementInfrastructure ManagementDevopsLinux Administration

Other Skills

TerraformPython (Programming Language)GitLabGitHubDockerAnsibleAWSKubernetesHelmHadoop AdministrationKafkaMySQLRedisZFSDNS Administration

About

8+ years experience as SRE, 20+ years as Linux Admin. Defining SLAs, SLOs, Error Budgets and building and refining processes to achieve them. Responsible for Critical Services, their uptime, reliability, deployment and troubleshooting. Have extensive experience of working in a large scale infrastructure, large and complex distributed system services, ensuring uptime, reliability and troubleshooting of high throughput systems. Have good automation skills (c, python and bash). Did automation around deployment, infra provisioning, process automation, alerting, error-recovering and other things. Learning hard core programming so that I can become an excellent tool-smith to help my Org and team to solve problems and automate stuff.

Experience

Phonepe

Lead Site Reliability Engineer

Dec 2022Present · 3 yrs 3 mos · On-site

  • In charge of entire CI/CD Infra in PhonePe. Working on GitLab, GitHub, Docker based build-infrastructure. Making this part of infra HA via tooling and automation.
  • Worked in Azure-Ops team, automated deployments using Hashicorp Packer and Terraform for a subset of Azure InfraStructure.
  • Helping fellow colleagues patch systems as per Audit Reports from 3rd Party Auditors (RBI, etc.)
  • Also, a little bit of Bash and Lua scripting, and helping my fellow team members make better decisions.
TerraformPython (Programming Language)Site Reliability EngineeringAutomation

Yugabyte

Senior Technical Support Engineer

Jun 2021Sep 2022 · 1 yr 3 mos · India

  • Primary role of solving customer technical issues (with their infra running on a variety of infra - Public/Private Cloud, VM/container based), following up with internal and external engineering teams on these issues and feature requirements and helping the customer to keep their services running with high uptime.
  • Worked on AWS and K8s, though not extensively, but know the basics well. Also know Helm charts.
  • Also responsible for writing internal documents and knowledge base articles to help improve documentation of the product and offer the customers a rich support experience.
  • Helped build and refine internal processes around customer support and escalation management, issue triaging and issue closure.
  • Currently mentoring a team of 3 junior members, helping them onboard and learn the nuances of the product, helping them understand the theory behind the distributed database so as to enable them to troubleshoot issues independently.
  • Wrote various bash/python scripts to help ease my job and help detect/troubleshoot issues better and more accurately.
AnsibledockerTechnical SupportCloud Infrastructure

Flipkart

3 roles

Operations (Platform) Engineer (PAAS)

Promoted

Jul 2017May 2021 · 3 yrs 10 mos

  • Founding member of SRE Team in Flipkart. Laid the roadmap, designed processes and implemented SRE culture in team.
  • Working as an SRE in Flipkart's Central PAAS team, maintaining high-throughput and highly scaled distributed systems. Responsible for service uptime, defining SLOs and SLAs meeting them.
  • Have knowledge on Kafka, MySQL, Redis, ZK, Java Applications, Docker, K8s, Python, Bash, Ansible, Git.
  • Dockerised of all stateless Java applications by using K8s and Helm chart.
  • Worked in Flipkart's Central Log and Metric Aggregation and Query Service team from 2017 to 2019.
  • Responsible for defining SLIs, SLOs and SLAs and defining process to achieve them.
  • Responsible for writing and verifying meticulous runbooks.
  • Responsible for reprovisioning infrastructure to keep the lights on.
  • Responsible for troubleshooting application level and customer issues.
  • Responsible for deployment of new code in production.
  • Responsible for scaling up systems for a high load events such as Flipkart Big Billion Day Sale.
Python (Programming Language)Hadoop AdministrationSite Reliability EngineeringDistributed Systems

Operations Engineer (EKart)

Promoted

Jan 2016Jul 2017 · 1 yr 6 mos

  • Worked as an Operations Engineer in eKart team of Flipkart. eKart is the logistics, warehousing and last mile delivery arm of Flipkart.
  • Responsible for maintenance, provisioning and uptime of 2000+ instances of VM and baremetals and the services deployed on them.
  • Offered MySQL-As-A-Service - was Single Point of Contact for Standardised Deployment, Maintenance and uptime of Production and Operation critical MySQL clusters (80+). Performed Very Complex MySQL Operations to keep things running. Achieved the target of 99.97% uptime against a target of 99.95% over a period of 1.5 year.
  • Managed dozens of Multi-Terrabyte MySQL clusters fronting multiple Tier-0 service that supported 10s of 1000s of QPS.
  • Audited entire infra reularly via automation to ensure tech hygiene and standarisation.
  • Responsible for predicting, calculating and procuring capacity requirements in advance.
Python (Programming Language)RedisOperations EngineeringDatabase Management

Operations Engineer

Aug 2014Dec 2015 · 1 yr 4 mos

  • Scaled up the FAI and Puppet Infrastructure to scale up DC Infra from 2k servers (baremetals) to 5.2k servers.
  • Was part of 25-Strong DevOps team supporting and maintaining the entire Flipkart Infra.
  • Conducted basic Audit of MySQL Databases pan-Flipkart.
  • Did troubleshooting of issues Devs faced throughout Flipkart. Helped them unblock on capacity, provided guidance in selecting capacity size requirements for upcoming projects.
  • Was MySQL Expert for entire Flipkart. Ensured databases are being backed up daily using automation, and fixed any replication issue that may occur.
ZFSDNS AdministrationInfrastructure ManagementDatabase Management

Data infosys

Linux Administrator / DevOps Engineer

Sep 2010Jul 2014 · 3 yrs 10 mos · Jaipur Area, India

  • Introduced DevOps way of working in Linux Admin team to support dev teams in the development of company products and services.
  • Worked closely with developers in implementing RFCs related to IMAP, SMTP and POP Protocols.
  • Maintained around 80+ Linux Servers, hosting Email Solutions for variety of customers ranging from NGOs to Govt Entities and private companies. Maintained DNS , LDAP Servers, and Postgres Database Servers (with support from DBA Team).
  • Implemented Automation in the team to speed of execution of daily tasks and to reduce error probabilities, via bash scripting. Implemented monitoring of App behaviour via Bash Scripts to raise more detailed Alerts, enabling faster troubleshooting of problems.
  • Built a team of 10+ Linux Administrators to ramp up support to Infra, Application Support and Development Team as well.
  • Delivered POCs for use case of Puppet to help implement Idempotency where ever possible.
  • Wrote exhaustive runbooks for Developers and Linux Administrators to help them troubleshoot problems faster and in systematic way.
DevOpsLinux System AdministrationLinux Administration

Stackforce found 100+ more professionals with Site Reliability Engineering & Automation

Explore similar profiles based on matching skills and experience