Monu Kumar β Data Engineer
π Data Engineer | Big Data | Spark | Scala | PySpark | Cloud (AWS, GCP, Kubernetes) | Performance Optimization I am a results-driven Data Engineer with expertise in Big Data processing, ETL pipeline development, data modeling, and cloud computing. I specialize in building scalable, high-performance data solutions, optimizing Spark workloads, and driving cost-efficient cloud architectures. With hands-on experience in Scala, PySpark, Apache Airflow, Kubernetes, and distributed computing, I thrive on solving complex data challenges. πΉ Key Expertise & Achievements: β Claims Data Processing (5TB+ Daily) β Developed PySpark validation workflows for schema enforcement, duplicate detection, and format validation, improving data accuracy to 98% and reducing anomalies by 25%. β Big Data Transformation β Built Scala-based Spark jobs to process hundreds of millions of claim records monthly, applying deduplication, enrichment, and aggregations, improving processing efficiency and reducing computation time. β Cloud Data Modeling & Optimization (500TB+ on BigQuery) β Designed partitioned and clustered data models, leading to a 20% boost in query performance for analytics and reporting. β OpenStreetMap (OSM) Data Processing β Built a Spark-Scala pipeline to convert OSM data to Apple Maps format, optimizing schema mapping, transformations, and aggregations, reducing conversion time by 87%. β Performance Tuning & Cost Optimization β Optimized Spark jobs on AWS EMR and Kubernetes, tuning memory settings, parallelism, and shuffle operations, reducing job execution time by 50% and cutting cloud costs by 35%. β ETL & Workflow Automation β Migrated Ab Initio jobs to Spark (Java/PySpark), optimizing shuffle operations, caching strategies, and parallel execution, reducing ETL execution time by 18%. β Data Storage & Processing Efficiency β Converted EBCDIC to Parquet, optimizing storage footprint and query performance, leading to significant cost savings. β Kubernetes-Based Spark Execution β Implemented dynamic resource allocation and optimized driver/executor memory, reducing job submission time by 10% and improving cluster utilization by 7%. β Scalable Data Pipelines β Automated ETL pipelines using Apache Airflow, reducing manual intervention by 40% and improving workflow reliability by 20%. I am always looking to enhance data architectures, drive performance improvements, and optimize large-scale data processing workflows. Letβs connect
Stackforce AI infers this person is a Data Engineering expert in Healthcare and Geospatial industries, specializing in Big Data solutions.
Location: Hyderabad, Telangana, India
Experience: 5 yrs 1 mo
Skills
- Big Data Processing
- Etl Development
- Data Modeling
- Data Engineering
- Data Processing
Career Highlights
- Expert in building scalable data solutions.
- Achieved 98% data accuracy in claims processing.
- Reduced cloud costs by 35% through optimization.
Work Experience
apree health
Data Engineer (2 yrs 1 mo)
Apple
Data Engineer via Unify (1 yr 4 mos)
Mphasis
Software Developer(Data engineer) (1 yr 8 mos)
Education
Bachelor of Technology - BTech at Bharati Vidyapeeth's College Of Engineerin Pune