Surendra Kumar

Consultant

Noida, Uttar Pradesh, India7 yrs 11 mos experience

Key Highlights

Expert in real-time data processing and data engineering.
Proven track record in optimizing data pipelines.
Strong experience with Azure and Kafka technologies.

Stackforce AI infers this person is a Data Engineering expert in SaaS environments with a focus on real-time data processing.

Contact

Skills

Core Skills

Data EngineeringReal-time Data ProcessingData ArchitectureData ProcessingEtl DevelopmentMachine Learning Automation

Other Skills

AirflowApache AirflowApache KafkaApache Spark StreamingAzure Cosmos DBAzure Data FactoryAzure Data LakeAzure DatabricksAzure DevOpsBashC (Programming Language)COBOLCOBOL IIDatabricksDelta Lake

Experience

7 yrs 11 mos

Total Experience

2 yrs 3 mos

Average Tenure

1 yr

Current Experience

Celebal technologies

Senior Consultant

May 2025 – Present · 1 yr · Noida, Uttar Pradesh, India · Remote

Tiger analytics

2 roles

Senior Data Engineer

Jan 2025 – May 2025 · 4 mos · Chennai, Tamil Nadu, India · Remote

Client Name: PepsiCo
Project Nature: Real-Time Data Processing for Intelligent Warehouse System
1. Designed and implemented a dynamic, metadata-driven ingestion pipeline to process over 20 distinct data topics across 38 sites, ingesting data from streaming platforms like Kafka and Azure Event Hub into Delta tables.
2. Developed and deployed end-to-end Delta streaming pipelines for over 100 KPIs across 38 sites, managing both batch and real-time (continuous) processing, and writing data into NoSQL Cosmos DB.
3. Engineered complex business logic transformations for real-time streaming data, ensuring high accuracy and operational efficiency.
4. Established error and exception handling frameworks for both streaming and batch pipelines, ensuring reliability and fault tolerance.
5. Implemented log analytics to capture and store runtime logs for efficient debugging and long-term analysis of streaming pipelines.
6. Optimized data pipelines, reducing latency to 2-3 minutes and lowering operational costs.
7. Developed an automated job restart pipeline using Databricks API to restart all streaming pipelines every 24 hours, maintaining cluster health and efficiency.
8. Implemented email notifications to alert stakeholders if data from any Kafka topic or site is missing for over two hours.
9. Created pre-data-generation scripts to ensure default values for 1-minute KPIs, maintaining data consistency.
10. Developed a pipeline for weekly optimization, vacuuming Delta tables, and archiving partitions older than 7 days to maintain system performance.
11. Scheduled all streaming pipelines in Databricks workflows using job clusters.
12. Implemented Change Data Capture (CDC) in Delta streaming tables.
13. Enabled task sharing in Databricks workflows by using job task value properties.
14. Optimized Cosmos DB performance by increasing RUs and by writing data into separate containers based on configured sites.
15. Implemented Slowly Changing Dimensions (SCD) Type 1.

KafkaDelta LakeAzure Cosmos DBDatabricksAzure Data FactorySQL+3

Data Engineer

Mar 2023 – Jan 2025 · 1 yr 10 mos · Chennai, Tamil Nadu, India · Remote

Client Name: PepsiCo
Project Nature: Medallion Data Architecture Pipeline
1. Designed a medallion architecture pipeline to ingest data from 38 sites across more than 30 Kafka topics, building Raw, Bronze, Silver, and Gold layer tables for stakeholders.
2. Configured pipelines to run daily or at multiple times a day, based on stakeholder requirements, by adjusting pipeline schedules.
3. Optimized pipelines by using partitioning, predicate pushdown, and project techniques, reducing runtime by 90% and costs by 97%.
4. Implemented Slowly Changing Dimensions (SCD) Type 1 and Type 2 based on table data.
5. Implemented a job-level parallelism using the spark. scheduler.pool
Project Nature: Kafka and File-Based Data Processing for Reporting and Analytics
1. Ingested data from Kafka topics into Delta tables using Spark Structured Streaming with AvailableNow as a batch job.
2. Ingested data from various file formats (CSV, JSON, and Parquet) using Autoloader in Databricks.
3. Applied medallion architecture to create raw, bronze, silver, and gold layers for data processing.
4. Implemented log analytics to capture pipeline logs for debugging and future analysis.
5. Applied liquid clustering to all layer tables.
6. Scheduled pipelines with Azure Data Factory (ADF) and wrote data to Synapse for Power BI reporting and SQL Server for web application integration.

KafkaDelta LakeAzure Data FactorySparkDatabricksData Engineering+1

Gep worldwide

Data Engineer

Oct 2021 – Mar 2023 · 1 yr 5 mos · Mumbai, Maharashtra, India

Client Name: Shell, HP, Chevron, BOA
Project Nature: ETL and Reporting System
The Data Platform Project is designed to collect both structured and semi-structured data from various sources and store it in Azure Data Lake Storage (ADLS) Gen2. Once the raw data is collected from different source systems, ETL pipelines are run to cleanse the data and transform it according to business requirements. The processed data is then utilized for reporting, dashboards, APIs, SFTP, and data visualization to meet internal and client needs.
Data Ingestion and Transformation:
1. Historical data was ingested from Elastic Search and streaming data from Apache Kafka into ADLS Gen2 in Delta format using Spark Structured Streaming APIs through Databricks Jobs.
2. A data transformation pipeline was implemented to generate flattened tables by joining various fact and dimension tables, which were subsequently used for report generation, API access, and CSV file creation (for sharing via SFTP).
3. A Date Dimension and measures were created to enable more effective analysis of business performance across different time periods.
4. Logging was implemented using Azure Data Factory (ADF) to track the execution history of all pipeline run instances.
5. A Power BI dashboard was developed to monitor and visualize the logs of the data ingestion and ETL pipeline processes.

Azure Data LakeApache KafkaPower BIAzure Data FactorySparkData Engineering+1

Tata consultancy services

System Engineer

Jun 2018 – Oct 2021 · 3 yrs 4 mos · Mumbai, Maharashtra, India

Client Name: Avis Budget and GSK
Project Nature: ML Image Metadata and Thumbnail Automation
I worked to industrialize the ML pipeline developed by data scientists and automate manual processes. This pipeline processes images scanned and uploaded by the scientists on remote drives, generating thumbnails and image metadata, which are then stored in Hive for consumption by the Spotfire Dashboard.
1. Developed a portable data pipeline that adapts to different environments (TEST/DEV/PROD).
2. Automated the execution of the data pipeline using Airflow DAGs with SSH Hook and Bash Operator.
3. Automated the identification of new device IDs to pass as parameters in the data pipeline.
4. Developed a solution to store the generated thumbnails in a file share location and the associated metadata in Hive tables.
5. Implemented a solution to write logs into log files and send the log files as email attachments.
6. Implemented an Airflow DAG to refresh thumbnails daily from the S3 object store.