Divya Sirala

CTO

Gurugram, Haryana, India3 yrs 8 mos experience

Most Likely To SwitchAI ML Practitioner

Key Highlights

Expert in building production-grade LLM systems.
Proficient in RAG architecture and benchmarking.
Strong leadership in AI/ML project execution.

Stackforce AI infers this person is a GenAI and AI Benchmarking specialist in the AI industry.

Contact

Skills

Core Skills

GenaiAgentic Ai DevelopmentRag ArchitectureSoftware Development

Other Skills

LangGraphScrumRetrieval-Augmented Generation (RAG)Artificial Intelligence (AI)Prompt DesignConfluenceProblem SolvingTeam LeadershipTeam ManagementAgile Project ManagementPrompt EvaluationLoRAMultimodal AIHugging Face TransformersOpenAI API

About

GenAI / Agentic AI Engineer with 4+ years of experience building production-grade LLM systems focused on reliability, benchmarking, and RAG architecture. I design evaluation-driven GenAI systems using LangGraph for multi-agent orchestration and LangChain for structured RAG pipelines, where prompts are versioned, agents are testable, and model performance is measured through automated benchmarking frameworks (TerminalBench-style). My work emphasizes reasoning evaluation, regression testing, determinism, and failure-mode analysis. I build enterprise-grade RAG systems with optimized chunking, retrieval quality tuning, grounding strategies, and hallucination control. Seeking senior GenAI / Agentic AI roles where production rigor, observability, and reliability are core system requirements.

Experience

3 yrs 8 mos

Total Experience

1 yr 2 mos

Average Tenure

1 yr 4 mos

Current Experience

Turing

2 roles

AI/ML Pod Lead

Promoted

Sep 2025 – Present · 9 mos · Remote

TBench 1.0 : Prompt Evaluation & Observability
Architected a LangSmith-style prompt evaluation pipeline, treating prompts as versioned, testable artifacts with controlled execution (frozen system prompts, low temperature, fixed tools). Led a pod of 6 engineers to curate task suites designed to expose reasoning failures, instruction-following gaps, and robustness issues. Evaluated GPT-5 and Claude Sonnet using traced runs and task-level scoring to identify prompt regressions, model weaknesses, and reliability tradeoffs, enabling data-driven prompt and model selection.

LangGraphScrumGenAIAgentic AI Development

AI/ML Engineer - GenAI & RAG - AI Benchmarking - Prompt Engineering

Jan 2025 – Present · 1 yr 5 mos · Remote

TBench 2.0 : Log-Driven Model Evaluation & Reliability
Designed prompts to stress-test and break model behavior across complex reasoning and edge cases. Built a log-driven model evaluation pipeline benchmarking GPT-Codex and Claude Sonnet against golden expectations. Analyzed execution logs to compare correctness, consistency, error handling, and failure patterns, surfacing subtle behavioral differences not visible in single-run testing and aligning with production-grade LLMOps practices.
Linux Environment LLM Benchmarking
Designed task prompts providing sufficient operational context for LLMs to solve terminal-based tasks in a Linux environment using only an initial instruction. Implemented a log-based evaluation workflow benchmarking GPT-5 and Claude against golden reference solutions. Evaluated task completion, reasoning fidelity, and failure recovery in non-chat, constrained execution environments to assess model suitability for system- and tool-oriented workloads.
RLHF, SFT & Chain-of-Thought Benchmarking
Worked on RLHF-, SFT-, and Chain-of-Thought–based benchmarking tasks to evaluate alignment, instruction adherence, and reasoning stability. Compared pre- and post-alignment behavior, focusing on reducing silent failures, improving consistency, and stabilizing long-context reasoning. Applied findings to guide prompt design and evaluation strategies in applied GenAI systems.

Retrieval-Augmented Generation (RAG)Artificial Intelligence (AI)GenAIRAG Architecture

Qbs learning

Data Scientist - GenAI & Agentic AI

Sep 2024 – Jan 2025 · 4 mos · Noida, Uttar Pradesh, India

Built and evaluated GenAI-powered solutions with agentic AI workflows for automation and intelligent decision-making.
Developed LLM-based prototypes (RAG, conversational AI) for education and training applications.
Collaborated with cross-functional teams to transform research concepts into scalable data science solutions.

Retrieval-Augmented Generation (RAG)Artificial Intelligence (AI)GenAIAgentic AI Development

Outlier

AI Trainer - Prompt Engineering & LLM Optimization - AI Benchmarking

Jan 2023 – Jan 2025 · 2 yrs · India · Remote

Trained and optimized LLMs using SFT, RLHF, CoT, and multimodal prompting.
Designed and reviewed datasets to enhance reasoning, accuracy, and alignment.

Artificial Intelligence (AI)Prompt DesignGenAIAgentic AI Development

Aristocrat technologies | emea

Software Developer

Sep 2022 – Aug 2024 · 1 yr 11 mos · Gurugram, Haryana, India

Developed and optimized C++ applications with focus on performance, multithreading, and debugging (GDB).
Contributed to game development systems on Linux, ensuring high reliability and scalability.
Collaborated in an Agile/SCRUM environment, streamlining development workflows with Git, JIRA, and CI/CD tools.

ConfluenceProblem SolvingSoftware Development