Viraj Thakur β Software Engineer
π Data Platform Engineer | BharatGen | Building AI for Bharat As a co-lead Data Platform Engineer at BharatGen, a semi-private, government-funded initiative, I contribute to Indiaβs mission of building foundational AI models β including Large Language Models (LLMs), Vision-Language Models (VLMs), and speech systems β tailored to Indian languages and domains. I architect and manage end-to-end data pipelines that power generative AI training at scale. My focus spans multilingual data acquisition, curation, synthetic augmentation, tokenization, and infrastructure observability. Our data work spans 16 Indian languages and diverse domains such as healthcare, agriculture, law, and governance. π Key Responsibilities Data Curation & Augmentation: Lead scraping, filtering, and OCR-based digitization of Indian web content and documents. Designed persona-driven synthetic data pipelines for scalable instruction-tuning data generation β enabling trillion-scale datasets using LLMs and prompt engineering. Training Readiness: Built robust preprocessing pipelines for transforming raw data into training-ready, tokenized formats. Minimized manual overhead through automation and reproducibility-focused tooling. Compute & Infra Monitoring: Monitor large-scale GPU training clusters using Prometheus and Grafana. Developed automated storage tracking dashboards to optimize petabyte-scale distributed data systems. Benchmarking & Research: Contributed to SFT and checkpoint evaluations; collaborated on novel research for scaling synthetic data, with a publication underway. Mentorship & Enablement: Mentored interns on setting up OCR engines and metadata-based scraping. Contributed to internal best practices for scalable data workflows and tooling. βοΈ Technical Stack Languages: Python, SQL, Bash Big Data: Spark, Hive, Hadoop, HDFS Pipelines: Airflow, Kafka Monitoring: Prometheus, Grafana Infra: DataHub, JuicyFS (in-house Data Lake) Cloud: AWS, GCP LLM Workflows: Tokenization, Synthetic Prompting, Evaluation π‘ What Drives Me Iβm driven by the depth and diversity of work β from LLM data prep to high-performance compute systems β all aimed at shaping India's AI future. Working with deep learning researchers and large-scale systems has been both a dream and a growth journey. My next goal is to build a centralized observability platform offering real-time visibility into our data and compute stack, driving transparency, accountability, and performance optimization at scale.
Stackforce AI infers this person is a Data Engineering expert in AI and Finance sectors.
Location: Mumbai, Maharashtra, India
Experience: 3 yrs 9 mos
Skills
- Data Engineering
- Ai Infrastructure
- Infrastructure Monitoring
- Etl Processes
- Data Migration
Career Highlights
- Expert in building scalable AI data pipelines.
- Led successful data migration projects in finance.
- Mentored interns in advanced data engineering techniques.
Work Experience
BharatGen
Data Platform Engineer (1 yr 1 mo)
LTIMindtree
Senior Data Engineer (2 yrs 8 mos)
Education
Bachelor of Engineering - BE at St. Francis Institute Of Technology
HSC at Sardar Vallabhbhai Patel Vidyalaya and Jr. College
SSC at Abhinav Vidya Mandir