Prakhar Mathur

SRE (Site Reliability Engineer)

India3 yrs 1 mo experience

AI EnabledAI ML Practitioner

Key Highlights

Expert in Site Reliability Engineering and Cloud Infrastructure.
Proven track record in enhancing system reliability and performance.
Passionate about scaling Generative AI solutions.

Stackforce AI infers this person is a Site Reliability Engineer specializing in Cloud Infrastructure and AI/ML observability.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud ComputingDevops

Other Skills

Ad TechDatadogArtificial Intelligence (AI)Proactive MonitoringGrafanaAI/ML workload observabilityElastic Stack (ELK)Incident ResponseKubernetesTicketing SystemsNew RelicDatabase SystemsSystem MonitoringService-Level Agreements (SLA)Generative AI

About

Hey there! I’m Prakhar, a Site Reliability Engineer (SRE) passionate about building the backbone of modern tech. I thrive on turning chaos into order, automating the ordinary, and ensuring systems are secure, scalable, and faster than your Wi-Fi at a coffee shop. Currently, I’m an SRE - I at IQM Corporation , where I focus on enhancing system reliability and optimizing cloud infrastructure. My work involves everything from defining SLOs and managing incident response to ensuring our platform runs smoothly. While I speak fluent SRE, my eyes are on the future of AI Platform Strategy. I believe that robust infrastructure is the unsung hero of Generative AI. I am focused on applying DevOps cultures and observability principles to scale AI/ML models reliably. Let’s connect if you want to chat about Cloud Infrastructure, AI Platform Strategy, or just geek out over the latest tech trends!

Experience

3 yrs 1 mo

Total Experience

1 yr

Average Tenure

11 mos

Current Experience

Iqm corporation

Site Reliability Engineer - I

Jun 2025 – Present · 11 mos · Remote

Core member of SRE team, collaborating with DevOps and AdOps units across India and the US to ensure the reliability and scalability of IQM Demand Side Platform.
> Leveraging DataDog and Grafana to implement advanced monitoring strategies. Focusing on anomaly detection and behavioral analysis to identify performance deviations in real-time, laying the groundwork for AI/ML workload observability.
> Assisting Platform team in capacity planning and performance optimization to ensure efficiently scaled resources for data-intensive processing.
> Engineered AI-ready observability pipelines using DataDog and Grafana to distinguish between infrastructure noise and application anomalies, enabling real-time latency tracking for high-frequency algorithmic bidding engines.
> Collaborated with cross-functional Data and AdOps teams to define Service Level Objectives (SLOs), ensuring the reliability of the platform and automated bidding system.

Ad TechDatadogArtificial Intelligence (AI)Proactive MonitoringSite Reliability EngineeringCloud Computing

Lakshya consultancy inc

Site Reliability Engineer

Jan 2024 – May 2025 · 1 yr 4 mos · Hybrid

I play a pivotal role in enhancing system reliability, optimising cloud infrastructure, and ensuring robust security standards across all environments. As a Site Reliability Engineer, I am responsible for monitoring CI/CD pipelines, leveraging automation to improve operational efficiency, and implementing secure, scalable solutions to support business growth.
My core responsibilities include:
> CI/CD Pipeline Development and Maintenance: Monitoring CI/CD pipelines, ensuring seamless deployment processes across development, staging, and production environments.
> AWS Cloud Architecture Design: Engineered secure and scalable architectures on AWS, leveraging services like VPC, EC2, S3, IAM, and EKS/Kubernetes, Xray, Cloudwatch, WAF to optimize application deployment and enhance system performance.
> Containerization and Orchestration: Built and managed secure Docker containers and orchestrated Kubernetes clusters to support microservices-based architecture, ensuring high availability and resilience of applications.
>Advanced Monitoring and Alerting: Implemented New Relic for comprehensive monitoring, alerting, and visualization of system performance and security metrics, enabling proactive issue resolution and minimizing downtime.
> Incident Management and Response: Managed incident response protocols, participated in on-call rotations, and conducted post-incident reviews to improve system reliability and prevent recurrence. Developed and enforced security policies to safeguard organizational assets.
> Automation and Scripting: Leveraged Linux and shell scripting to automate routine maintenance tasks and security updates, reducing manual efforts by 40% and enhancing overall system stability.
>Collaboration and Knowledge Sharing: Worked closely with cross-functional teams to conduct root cause analysis and implement effective solutions, contributing to a culture of continuous improvement and knowledge sharing.

Elastic Stack (ELK)DevOpsIncident ResponseKubernetesTicketing SystemsNew Relic+18

Decurtis corporation

2 roles

Associate Site Reliability Engineer

Aug 2023 – Dec 2023 · 4 mos · Jaipur, Rajasthan, India · On-site

I am dedicated to enhancing the reliability, scalability, and performance of critical systems. My passion for ensuring the availability of mission-critical services, combined with my adaptability in high-pressure environments.
Monitoring and Alerting: I've implemented and managed monitoring and alerting systems to swiftly address issues, ensuring maximum service uptime.
Automation: Proficient in automation tools and scripting, I've automated routine tasks to improve incident response times and reduce operational overhead.
Incident Management: I've played a pivotal role in incident management, actively participating in incident response and post-incident analysis to drive continuous improvement.
Data Governance: Ensuring that data generated and used by an organization's systems is managed with a focus on security, compliance, reliability, and efficiency.

KibanaElastic Stack (ELK)DevOpsProduction DebugIncident ResponseKubernetes+31