Shubam Sharma

Data Engineer

Pune, India4 yrs 8 mos experience

Highly Stable

Key Highlights

Achieved 70% cost savings in ETL processes.
Enhanced healthcare data accuracy with 97% matching.
Built scalable geospatial data pipelines across 197 countries.

Stackforce AI infers this person is a Senior Data Engineer specializing in Geospatial and Healthcare data systems.

Contact

Skills

Core Skills

SparkScalaFastapiBigquery

Other Skills

JavaAWSParquetProtobufKafkaRedisCassandraKubernetesS3OSMPythonGCPMSSQLDockerGitHub Actions

About

I work on Apple Maps' Geo Data Platform, mostly Spark/Scala pipelines that process geospatial and probe data across 197 countries. The stuff I've been focused on recently: cutting a 10+ hour ETL job down to 4.6 hours (70% cost savings), fixing a conflation bug that was generating 10 million false deltas per cycle, and building a pipeline to refresh airport geometry data from OSM that hadn't been touched since 2019. Before this I was at FIGmd for about 3.5 years working on healthcare data. Built Spark ETL jobs for patient-payer record matching, FastAPI services on GCP Cloud Run for medical record retrieval, and an event-driven reporting system with BigQuery and Pub/Sub. I like working on systems where getting the data right actually matters - whether that's someone's navigation route or a patient's medical records. Stack: Spark, Scala, Python, SQL, AWS (EMR/EKS/S3), GCP (BigQuery, Cloud Functions, Dataproc), Kafka, Cassandra, Redis, Parquet, Protobuf Open to Senior Data Engineer roles.

Experience

4 yrs 8 mos

Total Experience

3 yrs 10 mos

Average Tenure

10 mos

Current Experience

Apple

Senior Data Engineer

Jul 2025 – Present · 10 mos · Hyderabad · On-site

Working in the Maps org on the Geo Data Platform team. My day to day is Spark/Scala pipelines that handle geospatial data ingestion, conflation, and validation for Apple Maps across 197 countries.
Some things I've worked on:
Optimized a probe data ETL pipeline (~100 TiB/month across 10 countries). Brought runtime down from 10+ hours to 4.6 hours and cut monthly costs by 70% through serialized persistence, S3 checkpointing, and a tiered executor config (200/500/1000 executors based on country volume).
Found and fixed a bounding box bug in building conflation that was creating 10.3 million false deltas per cycle. One line fix, 97% reduction in noise.
Fixed a turn restriction bug that was incorrectly blocking valid turns on highways in 3 countries. Traced through 500+ lines of Spark transformations to find the root cause in path selection logic.
Built a Spark pipeline to refresh a stale airport geometry resource (untouched since 2019) by extracting from 1.1B+ OSM ways. Expanded coverage from 15.6K to 23K airports across 238 countries.
Debugged and stabilized 3P address ingestion pipelines across multiple countries during EMR to EKS migration. Fixed dynamic allocation issues, S3 URI scheme conflicts, serialization bugs.
Extended the OSM validation framework with 10+ data quality checks and fixed MapRoulette upload logic for per-country projects.
Tech: Spark, Scala, Java, AWS (EMR, EKS, S3), Parquet, Protobuf, Kafka, Redis, Cassandra, Kubernetes

SparkScalaJavaAWSParquetProtobuf+4

Figmd, inc.

Senior Software Engineer

Aug 2021 – Jun 2025 · 3 yrs 10 mos · Pune District · Remote

Worked on the data platform team building pipelines and services for healthcare data - clinical records, payer data, and reporting systems. Mix of Spark/Scala for batch processing and Python/FastAPI for APIs.
Built a Spark/Scala ETL pipeline that matched patient records with payer data by pulling from GCS and MSSQL. Got it to 97% matching accuracy with configurable rules per practice, and cut runtime by 40% through partition tuning and predicate pushdown.
Developed FastAPI microservices deployed on GCP Cloud Run for medical record retrieval from Cassandra and GCS. These served clinical workflows with strict SLA requirements.
Built an event-driven reporting system using Cloud Functions, Pub/Sub, and BigQuery. Cloud Scheduler handled orchestration, Log Sinks for observability. Fully serverless.
Migrated legacy services from Python 3.6 to 3.12. Set up CI/CD pipelines with 88% test coverage. Added structured logging and monitoring that brought MTTR down by 35%.
Tech: Spark, Scala, Python, FastAPI, GCP (Cloud Run, BigQuery, Cloud Functions, Pub/Sub, GCS), Cassandra, MSSQL, Docker, GitHub Actions

SparkScalaPythonFastAPIGCPCassandra+3