Yash Datta

Co-Founder

Singapore, Singapore, Singapore16 yrs 6 mos experience

Key Highlights

  • Expert in building scalable data pipelines and machine learning systems.
  • Proven track record in automating testing and enhancing software quality.
  • Strong experience in architecting complex software solutions.
Stackforce AI infers this person is a Fintech and AdTech expert specializing in data engineering and machine learning.

Contact

Skills

Core Skills

Machine LearningPythonScalaData PipelinesData ModelingSoftware EngineeringRecommender Systems

Other Skills

AMPSAlgorithmsAnalyticsApache CamelApache KafkaApache ParquetApache SparkApache Spark StreamingApache ZeppelinBig DataBlazegraphC (Programming Language)DatabasesDebuggingDistributed Systems

About

Experienced developer, well-versed in creating and productionizing complex software systems, leading by empathy, aiming for excellence. I am particularly good at writing foundational layers of code and common libraries that are modular, maintainable (testable), and easy to build upon (readable). I am also good at organizing code structure and documenting features and functionalities for fast adoption.

Experience

Highflame

Founding AI Engineer

May 2025Present · 10 mos · Remote

Machine LearningPythonPyTorchLarge Language Models (LLM)

Bank of america merrill lynch

VP (SSE III)

Sep 2022May 2025 · 2 yrs 8 mos · Singapore · Hybrid

  • Worked on complex entity interdependency logic to calculate how changes to an entity affect other interlinked entities in an RDF graph data model, stored in BlazeGraph, as part of Reference Data Team.
  • Worked on creating a generic adapter to ingest, transform, and store messages from AMPS to SQL database. The adapter is highly configurable and written in Scala using purely functional libraries like fs2 and cats.
  • Created a data transformation and consumption pipeline using Apache Camel framework that consumes messages from AMPS and stores information in SQL Database.
  • Led the enhancement of Jet, a post-trade processing service, automating regression testing, improving build pipeline stability, and eliminating flaky tests. Created an automated regression testing framework for testing complex flows of an event-sourcing-based system. The primary challenge was to add a module for sending and receiving AMPS messages via Gatling DSL.
ScalaPythonApache CamelData PipelinesAMPSGatling+2

Google summer of code

Software Engineer

Jun 2020Aug 2020 · 2 mos

  • Architected and developed a big data solution to load telescope alert data into Janusgraph at scale as part of Google Summer of Code project at AstroLab Software (HSF-CERN) . The solution consists of a Spark job that can read the alerts data, generate edges based on vertex classifier algorithms, and load the vertex and edges into Janusgraph, an open-source graph database.
Data Modeling

Standard chartered bank

Data Engineer @nexus

Dec 2019Sep 2022 · 2 yrs 9 mos · Singapore

  • Working on several spark jobs to compute user accounts and transactions data for nexus. The jobs are scheduled via airflow and run within the k8s cluster.
  • Led the testing automation effort within nexus team at SCB. Worked on developing an automated integration and regression testing framework for all microservices within nexus in a k8s cluster. Also helped in establishing standards and practices for development and testing cycles, evidencing and tracking, mocking / stubbing of requests etc. Apart from functional testing, also handled load and performance testing of the complete nexus system.
  • Established Wiremock as a framework for mocking external services within nexus, thereby making it easier to test the nexus system.
  • Created a functional and load testing framework called juggernaut, wrapping gatling library, used to test all the different microservices within the Nexus ecosystem at SCB. The tool is written in scala, using gradle as the build tool, and is run via Jenkins. It is highly configurable and uses data from json/csv files to fire requests to different services, then pushes the gatling stats to Elasticsearch for easy dashboarding over kibana. The framework has been extended to match request / response against expected data making it possible to write functional tests as well. It also generates a summary report of all the simulations run along with the customary pass / fail information.
  • Designed a centralized logging system(Elasticsearch, Fluentd, FluentBit, S3) which gathers the log data from k8s pods running the microservices.
Data PipelinesData Modeling

Rakuten asia pte ltd

2 roles

Senior Software Engineer

Sep 2017Dec 2019 · 2 yrs 3 mos · Singapore

  • Contributed significantly in architecting and developing an ad tracking solution for generating analytical reports that can be used to measure ad performance. The solution involved ingesting large amounts of data in real-time from kafka using spark-streaming and writing the transformed data into HDFS. Developed Standard ETL flows for rule-based fraud/filter detection using spark and later conversion reports are generated using hive queries.
  • Introduced Prometheus as the tool of choice for monitoring web-apis within Rakuten MPD. Helped in deploying as well as creating a scala wrapper for easily integrating Prometheus into any scala code base.
  • Architected and lead development of Behavioral Targeting Advertisement system within MPD. It was recently deployed to production, with no major issues being reported. This was a complex large scale system that involved interfacing with multiple external systems, communicating with all stakeholders, breaking down steps into actionable tasks and overcoming multiple technical challenges.
  • Developed a large scale ETL project “Curator” for processing ~ 8 TB of data in about 1 hour 40 minutes using apache spark.
  • Developed a project that scales with data and processes and stores it to elasticsearch using spark. (5X latency improvement with 2X more data than existing system)
  • Suggested an architecture for Scoring platform service “Easel” for recommending advertisements to users based on their behavioral data. Easel is now deployed to production over kubernetes.
  • Worked on a POC for streaming logs from low latency ad delivery service to Kafka, instead of writing to disk and then using fluentd to move them to a central server.
  • Active member of Architect’s Connect, a forum to discuss and develop new ideas / improvements to existing systems within Rakuten ecosystem.
  • Leading the initiative to establish better, more efficient QA practices within the team.
Apache Spark StreamingData PipelinesRecommender SystemsData Modeling

Senior Software Engineer

Sep 2017Dec 2019 · 2 yrs 3 mos · Singapore

  • Conducted a hands-on technical workshop on introduction to apache-spark for data processing use-cases, including spark APIs, spark basic concepts like shuffle, how data is distributed, spark streaming etc (November 2019).
  • Added several common utility libraries that are re-usable components across different projects. It includes a git config loading and caching utility, a scala wrapper over caffeine cache, a Prometheus metrics library for scala etc.
Data PipelinesData Modeling

Agoda

Senior Software Engineer

Jun 2016Aug 2017 · 1 yr 2 mos · Bangkok

  • Responsible for developing and scaling Agoda's city search flow. Involved in re-architecting, building and deploying the new search API.
  • Created a generic framework for handling all the different filters along with any complex combinations among them using AND/OR
  • Introduced Elasticsearch for solving several API use cases.
  • Reduced latency by up to 3X from the old architecture.
  • Developed a highly configurable general purpose akka-http based rest client , used for calling multiple different APIs in the Agoda ecosystem.
  • Developed a distributed sync service to sync data to Elasticsearch.
  • Involved in many different POCs for further improving latency and search capabilities.
Data Modeling

Guavus

3 roles

Technology Lead

Apr 2015Jun 2016 · 1 yr 2 mos · Gurgaon, Haryana, India

  • 1. CareReflex 2.0 rewrite using phoenix over Hbase.
  • 2. Evaluation of tech stack to use for next generation of carereflex product (Impala/Kudu/Hbase/Phoenix)
  • 3. Continued contributions / optimizations to apache spark-sql (1.4 , 1.5, 1.6)
  • 4. Bug-fixes and optimizations in apache-parquet project
  • 5. Optimizing very low latency spark queries for Acume Cache, a caching layer built on top of spark. Acume is able to serve (indexed data on subscriber id) time series and aggregate queries in less than 500 ms. Optimized for a load of 25 queries per second.
Data Modeling

Senior Software Engineer

Oct 2012Mar 2015 · 2 yrs 5 mos · Gurgaon, Haryana, India

  • Key role as platform developer for the data storage layer.
  • 1. Several optimizations to spark-sql (1.2, 1.3), parquet (1.6) for faster read queries.
  • 2. Bug fixing in spark-sql, parquet-mr projects
  • 3. Optimization of specific queries in spark-sql (1.1.1)
  • 4. Integrated impala into the guavus platform (CentOS box)
  • 5. Active role in integration of shark into the guavus platform. Provided support for running shark server and connecting to it via beeline.
  • 6. Add custom storage handler for infinidb (columnar datastore) in hive. The functionality allows us to store data from a hive table to a external table which is stored in infinidb. Also, queries that use data from native hive tables as well as external tables can be run.
  • 7. Add bin replay functionality , where data from past timestamps need to be persisted.
  • 8. Stabilizing and performance tuning of Insta. Optimized large aggregation queries with same tuple list .
  • 9. New Backup and Restore strategy for Infinidb.
  • 10. Re-structure Query Engine to handle generic cases (solution independent architecture).QE now uses Java service to spawn Map-Reduce jobs for record filtering based on key columns defined in configuration XML. Development of MR jobs for different source data for QE.
  • 11. Active bug fixing.
Data PipelinesData Modeling

Software Engineer

Dec 2011Sep 2012 · 9 mos · Gurgaon, Haryana, India

  • Worked as a Platform developer for Guavus's Data Anaytics pipe. Project work included:
  • 1 Development of features for Insta , the efficient data storage and retrieval service (structured live big data). Insta uses Infinidb (A column oriented scalable database) as its primary storage engine. Insta is written in c++
  • 2. Development of Query Engine, a forensic query and analysis service. QE uses HDFS and a mysql column store as storage sections. MapReduce Jobs are run to pull records from HDFS into Infinidb after annotations.
  • 3. Writing Test scripts for QE in python.
  • 4. Automating Infinidb Installation via Tall Maple CLI (Tall Maple is a custom Linux kernel)
Data Pipelines

Aristocrat

Software Engineer

Apr 2011Dec 2011 · 8 mos · Noida Area, India

  • Worked as a games developer (casino slot machine games) for aristocrat technologies.

Gemalto

R&D , Software Engineer

Jul 2009Apr 2011 · 1 yr 9 mos · Singapore

  • Have been part of the team in Research and Development Center, Singapore, which develops OS for SIM cards for mobile phones. The team targets low cost SIM card market of India, China, which present a constant challenge to deliver more applications/functionality, on the limited, low end hardware.
  • I worked on:
  • 1. Development of test-scripts for exhaustive testing of applications running on the OS.
  • 2. Code Size and RAM Optimization of application on SIM card (Samsung Calmshine16 V2 compiler)
  • 3. Porting of test scripts from native to .NET platform.
  • 4. Hardware testing of the product as a whole, to ensure it conforms to the GSM standards for
  • voltage, current consumption, and noise.

Education

Columbia University

Master's degree — Computer Science

Jan 2021May 2024

Netaji Subhas Institute of Technology

BE — Instrumentation and Control

Jan 2005Jan 2009

Apeejay School

Computer Science

Stackforce found 100+ more professionals with Machine Learning & Python

Explore similar profiles based on matching skills and experience