Monu Kumar — Data Engineer

🚀 Data Engineer | Big Data | Spark | Scala | PySpark | Cloud (AWS, GCP, Kubernetes) | Performance Optimization I am a results-driven Data Engineer with expertise in Big Data processing, ETL pipeline development, data modeling, and cloud computing. I specialize in building scalable, high-performance data solutions, optimizing Spark workloads, and driving cost-efficient cloud architectures. With hands-on experience in Scala, PySpark, Apache Airflow, Kubernetes, and distributed computing, I thrive on solving complex data challenges. 🔹 Key Expertise & Achievements: ✔ Claims Data Processing (5TB+ Daily) – Developed PySpark validation workflows for schema enforcement, duplicate detection, and format validation, improving data accuracy to 98% and reducing anomalies by 25%. ✔ Big Data Transformation – Built Scala-based Spark jobs to process hundreds of millions of claim records monthly, applying deduplication, enrichment, and aggregations, improving processing efficiency and reducing computation time. ✔ Cloud Data Modeling & Optimization (500TB+ on BigQuery) – Designed partitioned and clustered data models, leading to a 20% boost in query performance for analytics and reporting. ✔ OpenStreetMap (OSM) Data Processing – Built a Spark-Scala pipeline to convert OSM data to Apple Maps format, optimizing schema mapping, transformations, and aggregations, reducing conversion time by 87%. ✔ Performance Tuning & Cost Optimization – Optimized Spark jobs on AWS EMR and Kubernetes, tuning memory settings, parallelism, and shuffle operations, reducing job execution time by 50% and cutting cloud costs by 35%. ✔ ETL & Workflow Automation – Migrated Ab Initio jobs to Spark (Java/PySpark), optimizing shuffle operations, caching strategies, and parallel execution, reducing ETL execution time by 18%. ✔ Data Storage & Processing Efficiency – Converted EBCDIC to Parquet, optimizing storage footprint and query performance, leading to significant cost savings. ✔ Kubernetes-Based Spark Execution – Implemented dynamic resource allocation and optimized driver/executor memory, reducing job submission time by 10% and improving cluster utilization by 7%. ✔ Scalable Data Pipelines – Automated ETL pipelines using Apache Airflow, reducing manual intervention by 40% and improving workflow reliability by 20%. I am always looking to enhance data architectures, drive performance improvements, and optimize large-scale data processing workflows. Let’s connect

Stackforce AI infers this person is a Data Engineering expert in Healthcare and Geospatial industries, specializing in Big Data solutions.

Location: Hyderabad, Telangana, India

Experience: 5 yrs 3 mos

Skills

Big Data Processing
Etl Development
Data Modeling
Data Engineering
Data Processing

Career Highlights

Expert in building scalable data solutions.
Achieved 98% data accuracy in claims processing.
Reduced cloud costs by 35% through optimization.

Work Experience

apree health

Data Engineer (2 yrs 3 mos)

Apple

Data Engineer via Unify (1 yr 4 mos)

Mphasis

Software Developer(Data engineer) (1 yr 8 mos)

Education

Bachelor of Technology - BTech at Bharati Vidyapeeth's College Of Engineerin Pune

Monu Kumar

Data Engineer

Hyderabad, Telangana, India5 yrs 3 mos experience

Most Likely To Switch

Key Highlights

Expert in building scalable data solutions.
Achieved 98% data accuracy in claims processing.
Reduced cloud costs by 35% through optimization.

Stackforce AI infers this person is a Data Engineering expert in Healthcare and Geospatial industries, specializing in Big Data solutions.

Contact

Skills

Core Skills

Big Data ProcessingEtl DevelopmentData ModelingData EngineeringData Processing

Other Skills

AWS EMRAWS GlueAWS LambdaAmazon EC2Amazon Elastic MapReduce (EMR)Apache AirflowApache KafkaBigQueryData ValidationDockerEBCDICETLHadoopJavaJenkins

About

Experience

5 yrs 3 mos

Total Experience

1 yr 9 mos

Average Tenure

2 yrs 3 mos

Current Experience

Apree health

Data Engineer

Feb 2024 – Present · 2 yrs 3 mos · Hyderabad, Telangana, India · Hybrid

Developed PySpark-based validation workflows for 5TB+ of daily claims data, ensuring schema enforcement, duplicate detection, and format validation, reducing data anomalies by 25% and improving accuracy to 98%.
Built Scala-based Spark jobs to process hundreds of millions of claim records monthly, applying deduplication, enrichment, and aggregations, optimizing processing efficiency and reducing computation time.
Designed optimized BigQuery data models for 500TB+ of claims data, leveraging partitioning and clustering to enhance query performance by 20% for reporting and analytics.
Automated ETL pipelines using Apache Airflow, reducing manual intervention by 40%, improving pipeline reliability by 20%, and integrating monitoring and alerting for seamless execution.

PySparkScalaBigQueryApache AirflowBig Data ProcessingETL Development

Apple

Data Engineer via Unify

Oct 2022 – Feb 2024 · 1 yr 4 mos · Hyderabad, Telangana, India · Hybrid

Client: Apple Inc
Role: Data Engineer Apple Maps Data Ingestion Team
Ingested and processed OpenStreetMap (OSM) data, applying cleaning, validation, and quality checks for missing values, duplicates, and formatting errors, improving data accuracy by 20%.
Built a Spark-based data pipeline using Scala to convert OSM data into Apple Maps format, implementing data modeling, schema mapping, transformations, and aggregations, reducing conversion time by 87%.
Deployed and fine-tuned the data pipeline for efficient and scalable processing, improving overall speed and lowering resource usage by 25%, enhancing cost efficiency.
Optimized Spark jobs on AWS EMR, refining memory settings, partitioning, and shuffle operations, cutting job completion time by 50% and reducing cloud costs by 35%.

ScalaSparkAWS EMRData EngineeringData Processing

Mphasis

Software Developer(Data engineer)

Feb 2021 – Oct 2022 · 1 yr 8 mos · Pune, Maharashtra, India

Converted EBCDIC to Parquet, optimizing storage and processing efficiency, reducing data footprint and improving query performance, leading to cost savings.
Migrated Ab Initio jobs to Spark Java and PySpark, streamlining ETL workflows, reducing execution time by 18%, and optimizing shuffle operations, caching strategies, and parallelism for better Spark performance.
Optimized Spark job execution on Kubernetes, tuning driver/executor memory, dynamic resource allocation, and parallel processing, reducing job submission time by 10% and improving cluster utilization by 7%.

EBCDICParquetSparkData EngineeringETL Development