Rahul Jain

CTO

Bengaluru, Karnataka, India13 yrs 7 mos experience

Key Highlights

Over 12 years of experience in SRE management.
Proven track record in improving system reliability.
Expert in large-scale distributed systems and cloud infrastructure.

Stackforce AI infers this person is a Site Reliability Engineering expert in SaaS with extensive experience in big data technologies.

Contact

Skills

Core Skills

Site Reliability EngineeringHadoopCloud Infrastructure ManagementAwsData ManagementBig Data

Other Skills

LinuxAutomationDocumentationPython (Programming Language)observabilitySecurityMonitoringData VisualizationData AnalysisLeadershipAmazon Web Services (AWS)MapReduceHiveShell ScriptingTableau

About

Site Reliability Engineering (SRE) Manager with over 12+ years of experience in managing and optimising large-scale distributed systems. Proven track record of improving system reliability, performance, and efficiency in dynamic, high-availability environments. Skilled in leading cross-functional teams, implementing SRE best practices, and fostering a culture of continuous improvement. Specialised in: Large scale distributed data system, containerisation & orchestration (Docker & K8s), data visualisation (Grafana, Prometheus, Tableau), Deployment pipelines/automations (Jenkins), configuration management (Salt), tooling/scripting (Python).

Experience

13 yrs 7 mos

Total Experience

3 yrs 3 mos

Average Tenure

5 mos

Current Experience

Apple

Site Reliability Engineering Manager

Nov 2025 – Present · 5 mos

Media.net

Manager Site Reliability Engineering

Mar 2023 – Oct 2025 · 2 yrs 7 mos · Bengaluru, Karnataka, India · Hybrid

I oversee three teams integral to the dataplatform charter at media.net, managing multiple Hadoop, HBase, and Kafka clusters, GCP cloud infrastructure, multi-tenant Airflow, Prometheus and Grafana setups, Jenkins for CI/CD, and more. My primary role is to ensure the health, uptime, and optimization of these systems to support critical business pipelines within defined SLAs. I collaborate closely with Engineering, Business, and Product stakeholders to align SRE goals with business objectives.
Major Contributions:
Implemented a culture of documentation and ticketing within the teams.
Championed an automation-first mindset, reducing operational toil by 40%.
Enhanced Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) by 30% through improvements in monitoring and alerting frameworks.
Introduced a culture of Root Cause Analysis (RCA) and blameless postmortems, improving communication across teams.
Collaborated with the HR team and organizational leaders to enhance feedback systems, calibration processes, and promotion guidelines.

HadoopLinuxSite Reliability Engineering

Swiggy

Staff DevOps Engineer

Mar 2022 – Jan 2023 · 10 mos · Bengaluru, Karnataka, India · Hybrid

Leading an 8-member Production Engineering team responsible for managing Swiggy's cloud infrastructure on AWS. My role focuses on identifying opportunities to enhance infrastructure security, resilience, and efficiency.
Major Contributions:
Developed a solution from the ground up utilising AWS Configs to enhance the security posture.
Advocated for and implemented a transition from long-lived credentials to IAM-based temporary tokens for RDS ad hoc access.
Led an initiative to improve ticket turnaround time (TAT) by approximately 30% through process changes, an automation-first approach, daily stand-ups, and timely ticket triaging.
Promoted and implemented a solution for monitoring issues caused by Prometheus performance bottlenecks.
Initiated a tech talk series and used the recorded sessions to create a self-paced onboarding guide for new team members.
Served as a core member of the hiring and calibration committee.

Python (Programming Language)observabilityCloud Infrastructure ManagementAWS

3 roles

Staff Site Reliability Engineer

Promoted

Mar 2021 – Mar 2022 · 1 yr

Led the Bangalore-based Hadoop SRE team, collaborating with Dev and SRE counterparts in both Bangalore and the US.
Major Contributions:
Designed and implemented a solution to calculate the Hadoop cluster availability metric, used in leadership forums to track uptime for each cluster.
Overhauled the monitoring and alerting framework, significantly improving Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Collaborated with partner teams to define SLA/SLO and escalation criteria for all Hadoop services.
Served as a core member of the SRE hiring and promotion committee.
Organized and hosted a Hadoop meet-up event for LinkedIn.

HadoopLinuxSite Reliability Engineering

Sr. Site Reliability Engineer

Mar 2017 – Feb 2021 · 3 yrs 11 mos

As part of Grid SRE team I have worked on building Hadoop clusters of 100's of Petabytes of capacity with 10K plus nodes in one of our biggest cluster. Had dealt with the complexities comes with increasing size of Hadoop clusters both in terms of data volume and number of nodes.
Major Contributions:
Led the initiative to identify and eliminate Single Points of Failure (SPoF) in LinkedIn's Hadoop ecosystem.
Developed a tool (named Gridview) to gain visibility into cluster utilisation patterns and detect any misuse of the cluster resources.
Created automation for managing hardware inventory and cluster expansion.
Implemented automation to identify and rectify version drift for Hadoop and related services like Hive and Spark.
Developed a dashboard to capture all alerts triggered during on-call shifts, used to review on-call load and enhance the monitoring and alerting framework.
Represented LinkedIn at the Bangalore Hadoop Meetup group, showcasing and presenting our Hadoop cluster Insights solution Gridview to the wider community.

HadoopLinuxSite Reliability Engineering

Site Reliability Engineer

Feb 2015 – Mar 2017 · 2 yrs 1 mo

The Data Services team at LinkedIn oversees the management of the company's data and the operation of its highly scalable and extensive data ingestion pipelines. In my role on the Data Services team, my primary responsibility is to establish and maintain the data ingestion pipelines while ensuring compliance with SLAs.
Major Contributions:
Created a tool to visualize data availability at each stage of the pipeline, highlighting points of failure, SLA breaches, and data arrival times.
Developed a tool for live DAG (Directed Acyclic Graph) representation of critical pipelines, identifying delays, points of failure, and SLA breaches.
Established a self-serve portal for managing the retention and lifecycle of Hadoop datasets.

HadoopLinuxData Management

Tech mahindra

Hadoop Developer and Administrator

Apr 2012 – Jan 2015 · 2 yrs 9 mos · Hyderabad Area, India · On-site

At Tech Mahindra, the Big Data Center of Excellence team served as the primary contact for clients' big data technology needs. In this role, my primary responsibility involved reviewing client requirements, analysing them, and then building a Proof of Concept (PoC) environment and developing the solution. The leadership team would then present the POC results to the client and pitch for larger projects.
Major Contributions:
Implemented multiple Hadoop clusters using Cloudera, Hortonworks, and Apache Hadoop distributions.
Created a system for sentiment analysis of social media posts for a media production house.
Developed a solution for the Daimler group for analysing petabytes of data using Hadoop and Hive.
Designed several Tableau dashboards for client presentations.

HadoopLinuxBig Data