Viraj Thakur — Software Engineer

🌟 Data Platform Engineer | BharatGen | Building AI for Bharat As a co-lead Data Platform Engineer at BharatGen, a semi-private, government-funded initiative, I contribute to India’s mission of building foundational AI models — including Large Language Models (LLMs), Vision-Language Models (VLMs), and speech systems — tailored to Indian languages and domains. I architect and manage end-to-end data pipelines that power generative AI training at scale. My focus spans multilingual data acquisition, curation, synthetic augmentation, tokenization, and infrastructure observability. Our data work spans 16 Indian languages and diverse domains such as healthcare, agriculture, law, and governance. 🚀 Key Responsibilities Data Curation & Augmentation: Lead scraping, filtering, and OCR-based digitization of Indian web content and documents. Designed persona-driven synthetic data pipelines for scalable instruction-tuning data generation — enabling trillion-scale datasets using LLMs and prompt engineering. Training Readiness: Built robust preprocessing pipelines for transforming raw data into training-ready, tokenized formats. Minimized manual overhead through automation and reproducibility-focused tooling. Compute & Infra Monitoring: Monitor large-scale GPU training clusters using Prometheus and Grafana. Developed automated storage tracking dashboards to optimize petabyte-scale distributed data systems. Benchmarking & Research: Contributed to SFT and checkpoint evaluations; collaborated on novel research for scaling synthetic data, with a publication underway. Mentorship & Enablement: Mentored interns on setting up OCR engines and metadata-based scraping. Contributed to internal best practices for scalable data workflows and tooling. ⚙️ Technical Stack Languages: Python, SQL, Bash Big Data: Spark, Hive, Hadoop, HDFS Pipelines: Airflow, Kafka Monitoring: Prometheus, Grafana Infra: DataHub, JuicyFS (in-house Data Lake) Cloud: AWS, GCP LLM Workflows: Tokenization, Synthetic Prompting, Evaluation 💡 What Drives Me I’m driven by the depth and diversity of work — from LLM data prep to high-performance compute systems — all aimed at shaping India's AI future. Working with deep learning researchers and large-scale systems has been both a dream and a growth journey. My next goal is to build a centralized observability platform offering real-time visibility into our data and compute stack, driving transparency, accountability, and performance optimization at scale.

Stackforce AI infers this person is a Data Engineering expert in AI and Finance sectors.

Location: Mumbai, Maharashtra, India

Experience: 3 yrs 11 mos

Skills

Data Engineering
Ai Infrastructure
Infrastructure Monitoring
Etl Processes
Data Migration

Career Highlights

Expert in building scalable AI data pipelines.
Led successful data migration projects in finance.
Mentored interns in advanced data engineering techniques.

Work Experience

BharatGen

Data Platform Engineer (1 yr 3 mos)

LTIMindtree

Senior Data Engineer (2 yrs 8 mos)

Education

Bachelor of Engineering - BE at St. Francis Institute Of Technology

HSC at Sardar Vallabhbhai Patel Vidyalaya and Jr. College

SSC at Abhinav Vidya Mandir

Viraj Thakur

Software Engineer

Mumbai, Maharashtra, India3 yrs 11 mos experience

AI EnabledAI ML Practitioner

Key Highlights

Expert in building scalable AI data pipelines.
Led successful data migration projects in finance.
Mentored interns in advanced data engineering techniques.

Stackforce AI infers this person is a Data Engineering expert in AI and Finance sectors.

Contact

Skills

Core Skills

Data EngineeringAi InfrastructureInfrastructure MonitoringEtl ProcessesData Migration

Other Skills

AWSAgile MethodologiesAirflowAmazon Web Services (AWS)Apache ImpalaApache SparkAutomic (Software)AutosysAzure Data FactoryAzure Data LakeAzure DatabricksBashBig DataBig Data AnalyticsCloudera

About

Experience

3 yrs 11 mos

Total Experience

2 yrs 8 mos

Average Tenure

1 yr 3 mos

Current Experience

Bharatgen

Data Platform Engineer

Feb 2025 – Present · 1 yr 3 mos · Mumbai, Maharashtra, India · On-site

As a Data Platform Engineer with the BharatGen Team, I work at the intersection of large-scale data engineering and cutting-edge AI research to build scalable, reliable, and efficient data infrastructure for generative AI initiatives in India.
🔹 Key Responsibilities:
Lead efforts in web scraping, data acquisition, and intelligent filtering from diverse online sources, with a focus on transforming raw data into high-quality, training-ready datasets.
Design and maintain custom data curation and generation pipelines, optimized for scale and performance.
Build and manage an in-house data lake solution (Juicy FS) and integrate DataHub for robust data lineage tracking and governance.
Monitor and optimize GPU usage and compute resource allocation to ensure cost-effective and high-throughput model training workflows.
Stay on the frontier of AI by regularly reading and applying insights from state-of-the-art research papers, especially in the domains of LLMs and synthetic data generation.
Contribute to the development of synthetic datasets to supplement scarce or sensitive real-world data, supporting more inclusive and scalable model development.
Implement best practices in data governance, privacy compliance, and infrastructure observability using tools like Prometheus and Grafana.
💡 Working with a multidisciplinary team, my goal is to lay the foundation for India-centric generative AI solutions, supporting the BharatGen mission under the National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS), led by the Technology Innovation Hub at IIT Bombay.

PythonSQLBashSparkHadoopHDFS+10

Ltimindtree

Senior Data Engineer

Jun 2022 – Feb 2025 · 2 yrs 8 mos · Mumbai, Maharashtra, India · On-site

Client Northern Europe's Leading Bank
Conducted comprehensive data flow analysis and evaluated data fixes, resulting in accurate mapping for customer business requirements.
Developed, enhanced, and provided support for data ingestion and migration processes, ensuring seamless operations.
Collaborated closely with the business and analytics team to gather system requirements, aligning with organizational goals.
Demonstrated expertise in real-time data processing using Apache Spark and Hadoop, leading to efficient and timely data processing.
Utilized Python, SQL, and PySpark to optimize ETL operations, ensuring accuracy and reliability.
Monitored job execution, performed debugging, and resolved bugs to maintain smooth operations.
Reduced Hive ETL processing time from 12 hours to 8 hours, resulting in significant time savings and increased productivity.
Co-led the migration of on-premise data from Cloudera CDP6 to CDP7 in a Hadoop environment, improving the efficiency and scalability of data operations.
Automated job processes using Autosys(UC4), streamlining operations and enhancing efficiency.