Niraj Bhandarwar

Co-Founder

Delhi, India1 yr 4 mos experience

Most Likely To SwitchAI ML Practitioner

Key Highlights

Expert in AI and Generative AI technologies.
Proven track record in developing LLM evaluation frameworks.
Strong background in quantitative finance and machine learning.

Stackforce AI infers this person is a skilled AI Engineer with expertise in developing scalable AI systems for the tech industry.

Contact

Skills

Core Skills

Artificial Intelligence (ai)Machine Learning

Other Skills

Amazon Web Services (AWS)Deep LearningDockerGenerative AIGitGoogle Cloud Platform (GCP)Knowledge GraphsLarge Language Models (LLM)Natural Language Processing (NLP)Optimization AlgorithmsQuantitative FinanceReinforcement Learning

About

I'm an AI Engineer with a background in GenAI, agentic systems, and quantitative finance, and trained at IIT Delhi. At Scale AI, I contributed to rubric design and agent task annotation for real-world GitHub repositories, enhancing evaluation rigor for Claude-based LLMs. I've also worked on RL-driven trading strategies, RAG pipelines, and LLM reliability across developer workflows. I thrive in fast-paced environments where research meets execution. My toolkit includes Python, PyTorch, MLflow, Kubernetes, Airflow, and cloud platforms like AWS and GCP. I'm seeking full-time roles where I can help build reliable, scalable AI systems — particularly in GenAI, applied ML, or quant-driven startups.

Experience

1 yr 4 mos

Total Experience

8 mos

Average Tenure

1 yr

Current Experience

Stealth ai startup

Founding AI Engineer

May 2025 – Present · 1 yr · New York, United States · Remote

Building AI Products.

Scale ai

AI Engineer

Jan 2025 – May 2025 · 4 mos · San Francisco, California, United States · Remote

Ballerina Capuchina – LLM Rubric Development
Led the creation of 30+ rubric-based evaluation tasks for LLMs as part of the Ballerina Capuchina project, targeting real GitHub repositories like pandas, DVC, and mitmproxy.
Ensured each task met the benchmark of 60%+ rubric failure for Claude 3.7 baseline, increasing evaluation rigor and model differentiation.
Authored 10–20 atomic, objective, and self-contained rubric items per task, with clear mapping to prompt goals and critical vs. non-critical classifications.
Hyperion Augmentation - SWE Agent Task Annotator & Reviewer
Generated and refined over 250 software engineering problem statements and requirements with high specificity to guide LLMs in solving real GitHub PR issues.
Conducted test validation across >500 unit test logs (F2P/P2P) and ensured JSON accuracy, improving test coverage and agent reliability.
Documented public interfaces (functions/classes) from golden patches across multiple languages, enhancing modularity, clarity, and LLM performance in code tasks.