Sapna Jain

CTO

Noida, Uttar Pradesh, India19 yrs 1 mo experience

Most Likely To SwitchHighly Stable

Key Highlights

17 years of experience in data analytics and big data systems.
Expert in designing massive data platforms.
Proven track record in query optimization and distributed systems.

Stackforce AI infers this person is a Backend-heavy Infrastructure Engineer with extensive experience in data processing and optimization.

Contact

Skills

Core Skills

Query OptimizationDistributed Systems

Other Skills

MapReduceSQLData ProcessingJob SchedulingCachingData StructuresAlgorithmsC++ScalabilityHadoopBig DataDatabasesProgrammingHiveCloud Computing

About

I enjoy designing and implementing massive data platforms. I have 17 over years in data analytics big data systems.

Experience

19 yrs 1 mo

Total Experience

2 yrs 5 mos

Average Tenure

5 yrs

Current Experience

Microsoft

Principal Software Engineering Manager

May 2021 – Present · 5 yrs · Noida, Uttar Pradesh, India · On-site

Adobe

Senior Computer Scientist - 2

Apr 2019 – May 2021 · 2 yrs 1 mo · Noida, Uttar Pradesh, India

Sumo logic

Principal Software Engineer

Jan 2018 – Jan 2019 · 1 yr · Noida, Uttar Pradesh, India

Drubik

Co-Founder

Dec 2016 – Dec 2017 · 1 yr

Sumo logic

Engineering Manager

Sep 2013 – Nov 2016 · 3 yrs 2 mos · Noida, Uttar Pradesh, India

Microsoft research

Research Intern

Jun 2012 – Sep 2012 · 3 mos · Greater Seattle Area

SCOPE is a declarative programming language used for analytical processing at Microsoft Online Services Division. It processes petabytes of data on a cluster on tens of thousands of machines. I worked on Scope query optimizer and enhanced it to solve the problem of missing accurate data statistics using continuous query optimization and processing. We achieved an order of magnitude speed-up of the queries from real workload on production cluster.

Indian institute of technology, bombay

Graduate student

Jan 2011 – Dec 2013 · 2 yrs 11 mos

I pursued research in the area of query optimization for massively parallel data processing (Map-reduce based systems). The title for my research was Query optimization for massively parallel data processing.
We have developed a cascades-framework based query optimizer to generate parallel execution plans. The optimizer is independent of runtime and can be integrated with existing runtime with little work. We have integrated it with SQL derby parser and Hyracks with Hive on top. With Hyracks integration, we can optimize Hive query using our cost based optimizer and run it on Hyracks platform. We are working on developing framework which allows more efficient search of optimizer search space.

Bing search (microsoft corporation)

Senior Software Development Engineer

Oct 2006 – Dec 2010 · 4 yrs 2 mos · Greater Seattle Area

I worked as a developer with Bing Infrastructure team for four years. The team develops and manages a distributed storage and processing engine "Cosmos" on large scale data cluster of shared-nothing commodity servers. The engine is used to do offline batch processing of query logs & web data. I worked on various parts of the engine; a list of important projects is as follows:
Query optimization: I have worked with "SCOPE" optimizer team on different projects. I worked on various optimizations to optimize parallel join & union-all operators in the optimizer.
Converting optimizer physical algebra into execution graph: Our team used "Dryad" as distributed processing engine. The project involved converting the physical algebra created by optimizer into execution graph, so that each vertex of the graph would run one or more physical operators. The main challenge was to identify optimal operator grouping in a single execution vertex.
Job Scheduler: Designed & developed a multi queue job scheduler for Cosmos. The main challenge was to maintain isolation between queues, yet supporting legacy inter-queue operations.
Distributed Cache: Designed and developed distributed caching service. In a distributed execution environment, when a resource is required by a number of processes on a number of machines, caching service helps in eliminating storms of request on the hosts having original resource data and thus helps in reducing network load and latency. This becomes critical component in Execution pipeline for scaling the system.
Distributed data flow graph language: Worked on defining an xml based language to represent abstract distributed data flow graph. The main challenge was to come up with a simplified yet powerful and extensible language to represent abstract DDFG. Also, designed and developed a compiler for the language which generates a dryad runtime graph from the xml graph & executes it.