Teja K.

Co-Founder

Palo Alto, California, United States3 yrs 9 mos experience

AI EnabledAI ML Practitioner

Key Highlights

Built AI agent runtime processing 500K+ events/min.
Improved backend reliability, reducing latency by 35%.
Authored a widely-read article on AI production incidents.

Stackforce AI infers this person is a Backend Engineer specializing in AI Infrastructure and Healthcare Technology.

Contact

Skills

Core Skills

Distributed SystemsMicroservices ArchitectureInfrastructure OptimizationObservabilityBackend EngineeringHealthcare TechnologyInfrastructure AutomationCloud Computing

Other Skills

KubernetesLangGraphEKSAICost ManagementTelemetryPython (Programming Language)JavaAzure OpenAILangChainSpring BootMySQLHelm (Software)DockerTerraform

About

I’m a software engineer with 3+ years of experience building backend and distributed systems, focused on reliability, scale, and production infrastructure. I’ve worked on Healthcare IT, payment systems, service onboarding platforms, observability pipelines, and fault-tolerant backend services in large-scale environments. My core stack includes Java, Go, Python, AWS, microservices, messaging systems, and production troubleshooting. I’m most interested in the part of engineering where systems move from “it works in a demo” to “it works reliably in production”, back-pressure, retries, failover, instrumentation, resilience, and safe migrations. I also write about engineering failures, production tradeoffs, and lessons from building real systems. A piece I wrote on a $47K production AI incident has been read by 92K+ engineers and cited by industry leaders, helping me build strong connections across backend, platform, and AI infrastructure. I’m currently open to roles in backend engineering, distributed systems, platform engineering, and AI infrastructure.

Experience

3 yrs 9 mos

Total Experience

2 yrs 11 mos

Average Tenure

10 mos

Current Experience

Getonstack

Founding Engineer

Aug 2025 – Present · 10 mos

∙Built a multi-tenant AI agent runtime processing 500K+ events/min with sub-100ms policy enforcement latency, blue-green deployments on EKS ensuring zero-downtime releases and instant rollback across production AI workloads
∙Designed and implemented A2A protocol handling agent task delegation and cost tracking across multi-agent pipelines; built MCP-compliant tool registry supporting 340+ tools with dynamic routing and per-tenant isolation across 10+ enterprise beta users
∙Engineered cost control engine with per-agent budget enforcement and hard circuit breakers, reduced beta user AI infrastructure costs by 50-60% on average
∙Built distributed observability platform with real-time anomaly detection and loop-detection logic across multi-agent execution environment; reduced incident investigation time 40% through structured telemetry and cost attribution dashboards
∙Implemented CI/CD pipelines (GitHub Actions) with full test matrix, lint, type-check automation and blue-green deployment on EKS , zero-downtime releases with validated rollback paths

KubernetesLangGraphDistributed SystemsMicroservices Architecture

Tata consultancy services

2 roles

Software Engineer

Mar 2022 – Aug 2024 · 2 yrs 5 mos · On-site

Designed & shipped a GenAI-powered retrieval assistant for clinical documents (Azure OpenAI + LangChain + secure/VPN data access), improving top-k retrieval accuracy by 40% vs. keyword search; added safety filters, logging, and fallback flows.
Built and scaled high-volume REST APIs, event-driven workflows, and healthcare data integrations; optimized SQL queries, validation checks, and asynchronous scheduling logic to reduce baseline backend latency by 35%.
Improved reliability of a Java/Spring Boot data pipeline backed by MySQL/Aurora, Kafka/MSK, and OpenSearch; optimized async bottlenecks and retry failures to reduce p95 latency from ~550ms to 360ms.
Built CDC-style data workflows using DynamoDB Streams to capture database change events and trigger downstream processing; implemented idempotent retries and reconciliation checks to secure data movement.
Developed CI-based release guardrails using JMeter + Prometheus, reducing rollback incidents by 30%
Led 100+ RCA investigations across containerized Linux systems; authored runbooks and automated postmortems
Cut MTTR by 45% by deploying Prometheus + OpenTelemetry across 1000+ containers
Contributed to a performance-critical gRPC service handler in C++ to reduce tail latency and improve failover reliability
Reduced infra cost by ~$5K/month through ingestion tuning and telemetry retention improvements
Received On-the-Spot Award (x2) for backend reliability, automation tooling, and production stability

Python (Programming Language)JavaBackend EngineeringHealthcare Technology

System Engineer

Sep 2021 – Aug 2024 · 2 yrs 11 mos · On-site

Built service onboarding automation for an internal developer platform used by 200+ microservices, implemented a template generator (Spring Boot, Dockerfile, Helm), GitLab API integration for repo creation, Jenkins pipeline auto-generation, and Terraform modules for EKS, RDS, and IAM provisioning; reduced onboarding time from 10–14 days to 4–6 hours
Designed and built a centralized distributed lock service in Go backed by Redis (SETNX + TTL), handling race conditions and duplicate processing in payment flows; implemented exponential backoff, idempotent re-acquisition, and in-memory hot-key caching, load tested to 50K concurrent requests with p95 latency under 2ms
Implemented failover automation and DR drill workflows for a multi-region payment system across 3 AWS regions, automated Route53 health checks, secondary region promotion scripts, and DB reconnection flows; achieved RPO <30s and RTO <2min validated through live DR drills

KubernetesHelm (Software)Infrastructure AutomationCloud Computing

The spark foundation - empowering educators around the world

ML Intern

Oct 2020 – Dec 2020 · 2 mos

Built and deployed machine learning models using Python, achieving 95% prediction accuracy on target variables.
Assessed model performance with evaluation metrics like MSE and R-squared for continuous improvement.