Prakhar Agrawal

Data Engineer

Bengaluru, Karnataka, India6 yrs 9 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Architected a real-time CDC pipeline processing over 5 TB/day.
  • Reduced Redshift costs from $100K to $40K/month.
  • Developed automated data quality validation frameworks.
Stackforce AI infers this person is a Fintech Data Engineer specializing in real-time data processing and cost optimization.

Contact

Skills

Core Skills

Data EngineeringReal-time Data ProcessingData ManagementData GovernanceBusiness IntelligenceMachine Learning

Other Skills

Apache FlinkPyFlinkApache KafkaDebeziumApache IcebergApache SparkAmazon EMRAmazon S3AWS GlueAWS SESPythonSQLMySQLKafkaS3

About

Data Engineer at PayU with 5+ years of experience designing and operating production-grade data platforms in fintech. What I've built and own: → Real-Time CDC Data Lake — Architected an end-to-end streaming pipeline (MySQL → Debezium → Kafka → PyFlink → Apache Iceberg on S3) ingesting 5+ TB/day of CDC events at 10K events/sec, with 2-3 minute end-to-end latency. Designed a two-job architecture (stateless append + stateful ROW_NUMBER dedup over RocksDB) for fault isolation and replayability. Live in production, scaling to 20 tables. → Iceberg Health Framework — Built automated 5-step optimization on Spark/EMR: compaction by primary key, equality delete-file resolution, snapshot expiry, and orphan cleanup, with conflict-aware retries and SES-based HTML alerting. Also developed a sync-lag monitor leveraging Iceberg per-file column stats for memory-efficient scans. → Business-Centric Data Mart — Led a cross-functional initiative across all PayU offices to design two star-schema marts that standardized previously undocumented business logic. Cut Redshift spend from $100K/month to $40-45K/month and enabled 10x traffic on the same infrastructure. Migrated 20K+ legacy queries via SQLGlot script (90% automated) and built an AI-powered agent for ongoing onboarding. → ML Feature Pipelines — Built real-time feature systems for domestic and international fraud models using Spark Structured Streaming, Delta Lake on S3, and Redis for sub-millisecond serving. → Data Governance — Deployed OpenMetadata for org-wide discovery and lineage. Built a self-hosted Great Expectations framework during audit for automated data quality validation. Tech: Apache Flink (PyFlink), Spark Structured Streaming, Kafka, Debezium, Apache Iceberg, Delta Lake, Airflow, AWS (EMR, Redshift, MSK, S3, Glue, SES), Python, SQL, Redis, Cassandra, MySQL. Education: M.Tech Data Science and Analytics from IIIT Allahabad (Gold Medalist, 9.815 CGPA). Published in Journal of Intelligent and Fuzzy Systems (Feb 2023). Open to senior data engineering / staff engineer / data platform roles. Reach out — sunshineprakhar@gmail.com.

Experience

6 yrs 9 mos
Total Experience
2 yrs 3 mos
Average Tenure
3 yrs 5 mos
Current Experience

Payu

2 roles

Data Engineer - II

Promoted

Apr 2025Present · 1 yr 2 mos · Hybrid

  • Driving the modernization of PayU's data platform — from real-time CDC ingestion to cost-optimized analytics.
  • Key initiatives:
  • Architected a real-time CDC pipeline (MySQL → Debezium → Kafka → PyFlink → Apache Iceberg on S3) processing 5+ TB/day at 10K events/sec per table, with 2-3 minute end-to-end latency. Live in production, scaling to 20 tables by end of year with planned 4x infrastructure expansion.
  • Designed a two-job streaming architecture: stateless staging job (append-only audit trail) + stateful dedup job using ROW_NUMBER over RocksDB with 6-hour TTL and Iceberg merge-on-read upsert. Achieves fault isolation, independent scaling, and schema evolution without reprocessing Kafka.
  • Built a production Iceberg health framework on Spark/EMR with automated 5-step optimization (compaction, delete-file resolution, snapshot expiry, orphan cleanup), conflict-aware retries, and SES-based HTML alerting.
  • Developed a sync-lag monitoring system using progressive time-window scanning over Iceberg per-file stats, tracking Debezium lag, Flink lag, and end-to-end staleness with severity-based alerting.
  • Evaluating Hive Metastore migration to replace AWS Glue Catalog after identifying Athena query costs on Iceberg as unsustainable at scale.
  • Continuing to extend the business-centric data mart to cover the finance revenue model. Built an AI-powered agent that auto-converts legacy queries to the new mart schema and onboards business teams.
  • Stack: Apache Flink, PyFlink, Apache Kafka, Debezium, Apache Iceberg, Apache Spark, Amazon EMR, Amazon S3, AWS Glue, AWS SES, Python, SQL.
Apache FlinkPyFlinkApache KafkaDebeziumApache IcebergApache Spark+8

Data Engineer - I

Jan 2023Jun 2025 · 2 yrs 5 mos · Hybrid

  • Highlights:
  • Business-Centric Data Mart on Redshift — Identified that multiple business teams were running redundant full-table scans, driving runaway compute cost. Led a cross-functional initiative spanning all PayU offices to design two star-schema data marts with pre-joined flat tables, standardizing previously undocumented business logic. Result: monthly Redshift spend dropped from $100K to $40-45K while enabling 10x traffic on the same infrastructure. Mart serves 100+ users with 10-15 concurrent analysts.
  • Redshift Optimization — Optimized cluster with sort keys, distribution keys, and workload scheduling. Eliminated 8-node concurrency scaling entirely on the primary cluster and consolidated a secondary cluster from 4 to 2 nodes. Drove team-wide adoption of best practices.
  • Query Migration at Scale — Built a SQLGlot-based migration script that auto-rewrote 20,000+ legacy queries onto the new mart with 90% success rate, with the remaining 10% migrated manually. Later developed an AI-powered agent for ongoing query conversion.
  • Real-Time ML Feature Pipelines — Built feature systems for domestic and international fraud detection models. Streamed raw data into Delta Lake on S3, computed 1-minute aggregated features using Spark Structured Streaming, and stored them in Redis with memory-efficient keys for sub-millisecond model serving. Designed batch fallback workflows from Redshift through S3 with Spark batch jobs.
  • Data Governance — Deployed OpenMetadata (2023) as the organization's data catalog, enabling discovery, lineage tracking, and ownership management. Built a self-hosted Great Expectations framework (2024, during audit) for automated data quality validation across NRT and data mart tables.
  • Earlier Work — Built dashboards in Apache Superset, ran a Mage AI POC, and collaborated with DS team on user-activity detection models.
  • Recognized with the PayU ThankU Award and PayU Ace Award for collaboration, ownership, and delivering high-impact results.
PythonSQLRedshiftData GovernanceOpenMetadataGreat Expectations+1

Axtria - ingenious insights

Analyst Intern - Decision Science

May 2022Dec 2022 · 7 mos · Remote

  • Worked in Decision Science R&D Team.
  • Created and implemented models using SVM and ANN for classification with accuracy > 93%.
  • Generated datasets using multiple input folders by applying various transformations and
  • manipulations to get production ready data.
PythonSVMANN

Indian institute of information technology

2 roles

Data Science & Analytics placement coordinator

May 2021Dec 2022 · 1 yr 7 mos · Prayagraj, Uttar Pradesh, India

  • Responsible for Identifying the companies from various job portals and connect with the hiring team.
  • Build a strong relationship with them to facilitated the placement process for the entire batch and place our students with the best packages.
  • Handle all the activities related to placements which include screening our candidates based on the eligibility criteria, maintaining data for placements, feedback from the companies etc.
Microsoft ExcelBusiness Intelligence

Teaching Assistant

Apr 2021May 2022 · 1 yr 1 mo · Prayagraj, Uttar Pradesh, India

  • Worked as a teaching assistant under Prof US Tiwary sir for his knowledge engineering course. Prepared assignment and test for the evaluations. Also conducted viva.
Microsoft ExcelBash

Accenture

Application development Associate

Aug 2019Apr 2021 · 1 yr 8 mos · Greater Bengaluru Area

  • ETL implementation
  • I was responsible for
  • Extracting data from the various source systems like Oracle.
  • Scripting using Teradata-BTEQ through mainframes to load data from these Flat files into
  • intermediate (Temporary) tables.
  • Applying Transformations by writing Teradata SQL using mainframe scripts.
  • Loading data into the Interim tables (called Stage layer) by using Teradata load utilities like MultiLoad, Fast Load and Bteq.
  • Developing Windows batch scripts to complete assigned tasks and create unique data solutions to trigger job.
  • Importing the test data to the Mainframes.
  • Creating Unit test case and Unit test Report documents and validate.
  • Verifying the Development check list by Performing all the test cases and Moving the scripts to the Production Environment
  • Moving the scripts to the Handover (Production box).
  • Conducted research, gathered information from multiple sources and presented results.
  • CMSR Migration : Teradata to Hadoop
  • Was part of team that designed and developed HLD for the migration project.
  • Developed Hive queries and migrated data for more than 100+ flows from Teradata to Hadoop
  • staging area.
  • Handled importing of data from Teradata, performed transformations using Hive, and loaded data into data lake
HiveQLJira

Education

Indian Institute Of Information Technology Allahabad

Master of Technology - MTech — Data Science and Analytics

Jan 2021Jan 2023

Madhav Institute of Technology and Science, Gwalior

Bechlore of Engineering — Computer science and engineering

Jan 2015Jan 2019

Jawahar Navodaya Vidyalaya - JNV

Higher Secondary — Mathematics and Computer Science

Jan 2014Jan 2015

Jawahar Navodaya Vidyalaya - JNV

High School — Mathematics

Jan 2012Jan 2013

Stackforce found 100+ more professionals with Data Engineering & Real-time Data Processing

Explore similar profiles based on matching skills and experience