Ishan Gupta

Product Engineer

Rochester, New York, United States7 yrs 1 mo experience
AI EnabledAI ML Practitioner

Key Highlights

  • Achieved 22% lower $/token in AI infrastructure.
  • Delivered 40% p99 latency reduction on GPU-Kubernetes systems.
  • CNCF OSS Contributor with extensive cloud-native experience.
Stackforce AI infers this person is a SaaS and Cloud Computing Infrastructure Engineer with a focus on AI and Distributed Systems.

Contact

Skills

Core Skills

Cloud ComputingDistributed Systems

Other Skills

APIs - REST, GraphQL, gRPC, SOAPAlgorithmsAmazon Web Services (AWS)Artificial Intelligence (AI)AstrodynamicsBack-End Web DevelopmentBig Data AnalyticsBlockchainCC++CI / CD - Travis CI, Circle CI, Jenkins, GitHub workflows, Gitlab, Flux CD, Argo CDCognitive NeuroscienceCognitive PsychologyComputer Networking - Reverse Proxy (NGINX, Lighttpd, Apache Tomcat)Concurrency

About

AI Infrastructure / LLM Platform Engineer | Delivered 22% lower $/token and 40% p99 latency reduction on GPU-Kubernetes systems | CNCF OSS Contributor Translating research-grade AI systems into scalable, production-ready distributed infrastructure. Open to Mid-level AI Infrastructure & LLM Platform roles (Senior considered). I enjoy collaborating closely with research, product, and platform teams to turn experimental models into reliable, production-grade AI systems. I’m an AI Infrastructure & Distributed Systems Engineer with 4+ years of experience designing, scaling, and operating production-grade AI and LLM platforms, focused on building high-performance distributed systems that maximize GPU efficiency and reliability. Core engineering focus & impact: - Built and scaled multi-tenant GPU platforms for LLM and vision model training and inference - Designed topology-aware scheduling and GPU bin-packing (MIG, NCCL/RDMA, NUMA) → ~22% lower $/token - Optimized PyTorch runtime and CUDA/JAX workloads for large-scale training - Provisioned petabyte-scale MLOps infrastructure on Kubernetes - Scaled model serving pipelines with KServe/Triton and KEDA/HPA → ~40% lower p99 latency - Reduced idle GPU utilization (~27.5%) via rightsizing, quotas, and preemptible pools - Built horizontally scalable microservices backed by SQL, NoSQL, graph, and vector databases - Implemented end-to-end observability (Prometheus, OpenTelemetry, DCGM) with SLOs and error budgets Technical specialties: - Foundational AI / LLM systems, RAG, and agentic AI for cost-efficient inference - Large-scale model training & fine-tuning (7B–70B) using LoRA/QLoRA, FSDP, DeepSpeed - Parallel and distributed computing for AI training and inference pipelines - High-availability, event-driven backends for AI workloads - GPU/TPU infrastructure (A100/H100, DGX, TPU v5p/v5e) Background: - Experience across SaaS, on-prem, and enterprise AI platforms - Drove cross-geo engineering initiatives at VMware (Broadcom) on large-scale AI infrastructure - CNCF open-source contributor - M.S. in Computer Science (AI focus: RAG & Agentic AI) - Experience building cloud-native, multitenant SaaS LLM platforms and AI infrastructure for AI labs and enterprise environments. Past work: NLP for chatbots, computer vision for gaming/SLAM, on-device ML, GIS data science, and chaos engineering/SRE. In short: I specialize in AI infrastructure, distributed systems, and GPU optimization, turning research-grade LLMs into cost-efficient production deployments.

Experience

Career break

Professional development

Aug 2024Present · 1 yr 8 mos · Rochester, New York, United States

  • Pursuing Masters in Computer Science with a concentration on infrastructure and backends for AI.
  • https://calendar.app.google/61GL4sBxGmXfskGy7

Broadcom

R&D Engineer Software 3

Nov 2023Jul 2024 · 8 mos · Bengaluru, Karnataka, India · On-site

  • R&D Engineer working on software for VMware by Broadcom.
  • Part of the Tanzu (TNZ) division. Working on a Unified Control Plane development which can be used to build, run and manage modern multi-cloud and hybrid cloud apps on kubernetes at scale.
  • Solving problems that cut across multiple VMware software products, mainly Cloud Services Portal (CSP), Aria, Tanzu Application Platform, Tanzu Mission Control & Tanzu Network.
  • Drafted Broadcom’s distribution registry (JFrog Artifactory) migration and access plan for TAP-TMC artifacts hosted in TanzuNetwork’s Harbor instance.
  • As a technical lead, designed and developed the orchestration and deployment of Tanzu application platform from Tanzu Mission Control Self Managed (on-Prem) to further enhance the enterprise readiness of the offering for large scale customer adoption. Worked with multiple TLs, cross geo teams, product managers and solution engineers to design the best possible product integration.
  • Enabled new features and integrations after evaluating direct inputs from multiple customers for the Tanzu Mission Control product including Tanzu Catalog, Terraform Provider, Tanzu Build Service.
  • Innovating via design inputs on (TAE) Tanzu Application Engine’s Platform Engineer (PE) console authentication, fanout & cluster onboarding. Also, (DP) Developer Portal’s deployment and orchestration.
  • Learning and developing on KCP to build solutions which extends the control plane for tackling product requirements that require intricate design and a massively multi-tenant platform.
  • https://github.com/kcp-dev/kcp
Deep LearningCloud ComputingDistributed SystemsGo (Programming Language)Machine LearningContainers - Docker & Kubernetes - GKE, EKS, AKS, K3s, Minikube, Kind, Kubeadm+3

Vmware

2 roles

Member of Technical Staff 3

Aug 2023Nov 2023 · 3 mos · Bengaluru, Karnataka, India · On-site

  • Research & Development Engineer for Tanzu Mission Control (A managed Kubernetes SaaS on the VMware Cloud services platform).
  • Multi-Cloud, Hybrid-Cloud, on-premises & edge Kubernetes cluster management.
  • Engineering cluster management components & and solutions for the Modern Applications Business Unit. (Modern Applications & Management Business Group)
  • Developing & maintaining multiple TMC services & extensions.
  • Delivered TKP Package deployment components for TMC Self-Managed (On-Premises), Continuous Delivery to cluster groups and Bring Your Own Image Registry (BYOIR).
  • Tech lead for package deployment components (Carvel packages, Bitnami Helm releases, GitOps and secret management) for TMC - Self Managed.
  • Implemented EULA APIs in TMC. Also lead the design and internal release of TAP-TMC integration Phase - 1.
  • Worked on the deployment and post install orchestration of Tanzu Application Platform product via Tanzu Mission Contol SaaS. Designed and implemented an automated login and credential rotation controller for Tanzu network via TMC, reducing time to onboard users.
  • Working on Unified Control Plane (UCP) cluster onboarding and package deployment for provisioning developer portals through platform engineering console.
  • Learning and building on KCP for K8s - https://github.com/kcp-dev/kcp
Deep LearningCloud ComputingDistributed SystemsGo (Programming Language)Machine LearningContainers - Docker & Kubernetes - GKE, EKS, AKS, K3s, Minikube, Kind, Kubeadm+3

Member of Technical Staff 2

Nov 2021Aug 2023 · 1 yr 9 mos · Bengaluru, Karnataka, India · On-site

  • Developed & maintained TMC services & extensions across
  • > Control plane & agent plane
  • Data Protection Service (Backup, Recovery & Migration using Velero based cluster extension for agent) - Cluster Inspection Service (Conformance & Compliance using Sonobuoy based cluster extension for agent) - Package Deployment and GitOps Service (using Carvel and Flux CD)
  • > Control plane
  • Notification Service & SLI based alerts (in-app & outbound using VMware Cloud Services Platform)
  • Usage Metering Service (real-time metrics engine ingesting data from various control plane services including cluster agent service and CSP SKUs)
  • Developing components that extend & program the Kubernetes control plane - Controllers & Operators for Custom Resources
  • Developing fault-tolerant Distributed systems on Kubernetes
  • Contributing as a backend engineer for Chaos Engineering, Distributed caching, API Rate Limiting, Site Reliability Engineering, Load Testing, Platform Engineering & Scale Engineering (Infrastructure + Framework design & development)
  • Contributing to Data protection and resiliency features of TMC, Added security policy and proxy support for TMC terraform provider: github.com/vmware/terraform-provider-tanzu-mission-control
  • Technologies: Distributed Systems, Golang, Gomega, Ginkgo, Gomock, Docker, Kubernetes, AWS, GCP, GKE, EKS, AKS, ELK stack, Argo Workflows, Terraform, LitmusChaos, Prometheus, WaveFront, GitOps, DevOps, Cloud computing, VMware CSP, TMC, VSphere, VSAN, Velero, Sonobuoy, Operators, Controllers, Informers, Jenkins, Redis, PostgreSQL, DB Triggers, Apache Kafka, AWS services - SQS, SNS, RDS, DynamoDB, MSK
Cloud ComputingDistributed SystemsBack-End Web Development

Vmware tanzu

Research & Development Engineer

Nov 2021Jul 2024 · 2 yrs 8 mos · Bengaluru, Karnataka, India · On-site

  • Development of Tanzu Mission Control (Tanzu Common Core SaaS Foundation)
  • RnD across cloud-native applications & distributed systems
  • Development around cluster lifecycle management (LCM) & cluster API
  • Contributing to VMware Tanzu OSS projects - Carvel, Sonobuoy, Velero & TMC Terraform provider.
  • Resilience Engineering Integrations for products and projects
  • Contributing to TKP (Tanzu Kubernetes Platform)’s build integration services in TMC (Tanzu Mission Control) - package (carvel) deployment service and extension, helm (using flux controllers) deployment service and extension, cluster config service (GitOps), cluster agent service, extensions manager and updater.
  • Technologies: Distributed Systems, Golang, Gomega, Ginkgo, Gomock, Docker, Kubernetes, AWS, GCP, GKE, EKS, AKS, ELK stack, Helm, Flux, Argo Workflows, LitmusChaos, Prometheus, WaveFront, GitOps, DevOps, Cloud computing, VMware CSP, TMC, Carvel, Velero, Sonobuoy, Operators, Controllers, Informers, Jenkins, Redis, PostgreSQL, DB Triggers, Apache Kafka, AWS services - SQS, SNS, RDS, DynamoDB, MSK
  • Backend Engineering, Usage metering, Notification, Chaos Engineering, Distributed caching, API Rate Limiting, Data Protection, Cluster Inspection, Site Reliability Engineering, Scale Engineering, Platform Engineering, Load Testing (Infrastructure & Framework)
Cloud ComputingDistributed SystemsBack-End Web Development

Chaosnative (acquired by harness inc.)

2 roles

Software Engineer 1

May 2021Oct 2021 · 5 mos · Bengaluru, Karnataka, India · Hybrid

  • Backend engineering for real time monitoring of time series metrics from micro-services, Full stack software development, test automation and DevOps
  • Contributing to OSS cloud native Chaos Engineering framework, LitmusChaos
  • Developing Litmus Portal (Chaos Center), a multi-tenant cross-cluster multi-cloud chaos control plane with On-Prem variant and air-gapped support
  • Improving horizontal scalability and availability of Litmus with performance optimisations (Multi-tenancy, H/A, Caching) for enterprise usage and adoption
  • Building integrations and automations for internal tools
  • Built open observability, analytics and integrations for Litmus 2.0.
  • Designing / building observability, authentication, authorization and enterprise features (Licensing & Payments) for products - CLC SaaS and CLE
  • Chaos Engineering, Site Reliability Engineering, Platform Engineering & Observability
  • Technologies: Distributed Systems, Caching, LoadBalancing, Apache Kafka, Redis, Percona, mongoDB, mySQL, Python, Golang, Docker, Kubernetes, AWS, GCP, GKE, EKS, Okteto, Kublr, GitHub Actions, Argo CD, LitmusChaos, Prometheus, Grafana, React.js, visx, Typescript, Material UI, Cypress, GitOps, DevOps, Cloud computing, Chaos Engineering, Operators, Site Reliability Engineering
Cloud ComputingDistributed SystemsBack-End Web Development

Software Engineer Intern

Feb 2021Apr 2021 · 2 mos · Bengaluru, Karnataka, India · Hybrid

  • Full-stack Software Development
  • DevOps engineer
  • Building and providing support for MayaData’s Kubera Chaos
  • Development of OSS chaos engineering framework LitmusChaos on Kubernetes.
  • Technologies: Apache Cassandra, Apache Kafka, Percona, mongoDB, mySQL, Golang, Docker, Kubernetes, AWS, GCP, GKE, EKS, Travis CI, Circle CI, GitHub Actions, Flux CD, Argo CD, LitmusChaos, Prometheus, Grafana, React.js, Plotly.js, Typescript, Material UI, Styled Components, Gatsby.js, HubSpot, Netlify, Selenium, Jest, DevOps, Cloud computing, Chaos Engineering, Operators, Site Reliability Engineering
Cloud ComputingDistributed SystemsBack-End Web Development

Eyerov (irov technologies private limited)

Computer Vision and Deep Learning Intern

Jun 2020Aug 2020 · 2 mos · Kochi, Kerala, India · Remote

  • Image Similarity Ranking
  • Underwater ROV Image processing
  • Image stitching using Algorithmic and Deep Learning methods.
  • Technologies: Python, Computer vision, Image processing, Deep Learning, Django

Anz

2 roles

Cyber Security - Virtual Intern

May 2020May 2020 · 0 mo · Kolkata, West Bengal, India · Remote

  • Social Engineering Investigation and Digital Investigation.

Data@ANZ - Virtual Intern

Apr 2020May 2020 · 1 mo · Kolkata, West Bengal, India · Remote

  • Exploratory Data Analysis and Predictive Analytics.

Cloud native computing foundation (cncf)

Open Source Software Maintainer (LitmusChaos)

May 2020Dec 2022 · 2 yrs 7 mos · Bengaluru, Karnataka, India · Remote

  • Open source contributor, reviewer and core team member
  • software developer and chaos engineer for the CNCF chaos engineering sandbox project, LitmusChaos
  • Technologies: Kubernetes, Public clouds (AWS, GCP, Azure), Chaos Engineering, Site reliability engineering, Observability, Monitoring, Analytics
Cloud ComputingDistributed SystemsBack-End Web Development

Litmuschaos

Open Source Software Maintainer

May 2020Dec 2022 · 2 yrs 7 mos · Bengaluru, Karnataka, India · Hybrid

  • Open source contributor, reviewer and core team member
  • software developer and chaos engineer for the CNCF chaos engineering sandbox project, LitmusChaos
  • Technologies: Kubernetes, Public clouds (AWS, GCP, Azure), Chaos Engineering, Site reliability engineering, Observability, Monitoring, Analytics
Cloud ComputingDistributed SystemsBack-End Web Development

Mayadata (acquired by datacore software)

Software Engineer Intern

May 2020Jan 2021 · 8 mos · Bengaluru, Karnataka, India · Remote

  • Full-stack Software Development
  • DevOps engineer
  • Chaos Engineering for Kubera
  • Development of OSS chaos engineering framework LitmusChaos on Kubernetes
  • Integrated litmus with platforms like Kublr (centralized monitoring) and Okteto (multitenant environment)
  • Technologies: Apache Cassandra, Apache Kafka, Percona, mongoDB, mySQL, Golang, Docker, Kubernetes, AWS, GCP, GKE, EKS, Travis CI, Circle CI, GitHub Actions, Flux CD, Argo CD, OpenEBS, LitmusChaos, Prometheus, Grafana, React.js, Plotly.js, Typescript, Material UI, Styled Components, Gatsby.js, Selenium, Jest, DevOps, Cloud computing, Chaos Engineering, Operators, Site Reliability Engineering

Highradius

Summer Intern (Software Development)

Apr 2020Jun 2020 · 2 mos · Bhubaneshwar, Odisha, India · Remote

  • AI powered Fintech Application Development
  • RPA
  • Technologies: Java, JDBC, mySQL, Apache Tomcat, Python, Jupyter, Machine Learning, Data visualisation, Data processing, Data analytics, React.js, Dialogflow, NLP

Deloitte

Technology Consutant - Virtual Intern

Apr 2020May 2020 · 1 mo · Kolkata, West Bengal, India · Remote

  • Client Discovery
  • Design a Business Case
  • Considerations For Mobilization
  • Understanding Cloud Computing
  • Cloud Feasibility Assessment
  • Cloud Readiness Assessment
  • Define the project approach
  • Conduct a market scan
  • Further analysis & solution and presentation.

Jpmorgan chase & co.

Software Engineer - Virtual Experience

Apr 2020May 2020 · 1 mo · Kolkata, West Bengal, India · Remote

  • Establishing Financial Data Feeds
  • Frontend Web Development
  • Data Visualization with Perspective

Kpmg

Data Analytics Consultant- Virtual Intern

Apr 2020May 2020 · 1 mo · Kolkata, West Bengal, India · Remote

  • Data Quality Assessment
  • Data Analysis
  • Data Insights and Presentation.

Dsc kiit

3 roles

Cloud Applications Developer

Jan 2020Apr 2020 · 3 mos · Bhubaneswar, Odisha, India · On-site

  • Mentoring on Cloud Application Development and GCP platform
  • Technologies: GCP platform and services (BigQuery, Compute Engine, App Engine, GKE, PubSub, DataProc, BigTable, LoadBalancing, Networking) Cloud computing, Kubernetes

Data Science, Machine Learning & AI Researcher

Mar 2019Apr 2020 · 1 yr 1 mo · Bhubaneswar, Odisha, India · On-site

  • Data Science Group Member in DSC KIIT.

Core Team Member

Mar 2019Apr 2020 · 1 yr 1 mo · Bhubaneswar, Odisha, India · On-site

  • Machine Learning Core Team Member and Mentor.

Skyline racing

Software Development Engineer

Jul 2019Apr 2020 · 9 mos · Bhubaneswar, Odisha, India · On-site

  • Official Team Member of Skyline Racing (Software and Technical Department)

Neurapses technologies

AI / NLP Intern

Jun 2019May 2020 · 11 mos · Greater London, England, United Kingdom · Hybrid

  • Responsible for development of autosys crawler using RPA
  • Data mining and acquisition
  • Preprocessing of click stream data
  • Semantic sentiment analysis and clustering using Natural Language Processing
  • Development of restful API for ML and neural net. models
  • Customer churn and review analysis of TripAdvisor and other hotel booking sites.
  • Technologies: RPA, Web scraping, Web crawling, Data mining, Data processing, Machine Learning, NLP, Deep Learning, Python, Flask

Dafntech

Machine Learning Intern

May 2019Jun 2019 · 1 mo · Kolkata, West Bengal, India · On-site

  • Responsible for the development of a classification system using machine learning for finding donors who can contribute to social good provided their demographic, financial, and other necessary information and a python API for the same.
  • Technologies: Machine Learning, Python, Pandas, Numpy, Scipy, Sklearn, Matplotlib, Plotly, Jupyter, Flask

Project swag (swayamchalit gaadi)

Software Developer

Apr 2019Apr 2020 · 1 yr · Bhubaneswar, Odisha, India · On-site

  • Software Development of Intelligent Ground Vehicle (Self-Driving Car), SWAG Mark 1.
  • Technologies: Python, Deep Learning, Machine Learning, Image processing, Sensor fusion, Lidar, ROS, SLAM, Raspberry Pi, Nvidia Jetson Nano, Parallel processing, CUDA, Tensorflow, Keras, Pytorch, RCNN, Django, node.js, Redis, Apache Kafka, Apache Cassandra, PostgreSQL, Java, Spring boot, Groovy, JUnit

Samvriddhi infotech

Data science for GIS Intern

Apr 2019Sep 2019 · 5 mos · Gurgaon, Haryana, India · Hybrid

  • Responsible for GIS application development for IGL portal on ArcGIS for ONGC
  • Report structuring on geocortex reporting
  • Geocortex essentials designer
  • Data acquisition and analysis from live streaming geospatial data
  • Network and server analysis.
  • Technologies: Data analysis, GIS, ArcGIS, Geocortex, Java, Apache Spark, Apache Tomcat

Viden.io

Business Development Intern

Feb 2019Apr 2019 · 2 mos · Bhubaneswar, Odisha, India · On-site

  • Responsible for Marketing
  • Promotion
  • S.W.O.T. analysis
  • Business development analytics of viden.io for strategic growth and expansion of the platform.

Think india

Head Of Promotions

Aug 2018Dec 2018 · 4 mos · Bhubaneswar, Odisha, India · On-site

  • Responsible for leading the promotion of think India and event(srujana).

Koderunners

4 roles

Moderator

Apr 2018Apr 2020 · 2 yrs · Bhubaneswar, Odisha, India · On-site

  • Core Team Member and Mentor.
  • Technologies: OpenGL, Unity3d, Unreal engine, Processing 2d, Blender, 3D builder, Object physics, C#, Java

Research And Development Engineer

Mar 2018Apr 2020 · 2 yrs 1 mo · Bhubaneswar, Odisha, India · On-site

  • Worked on several Research and Development Projects.

Tech Lead

Jan 2018Apr 2020 · 2 yrs 3 mos · Bhubaneswar, Odisha, India · On-site

  • Tech lead in game development.

Public Relation Officer

Dec 2017Apr 2020 · 2 yrs 4 mos · Bhubaneswar, Odisha, India · On-site

  • Public Relations
  • Management
  • Promotions

Yoken

Promotions Officer

Jan 2018Mar 2018 · 2 mos · Bhubaneswar, Odisha, India · On-site

  • Volunteer for promotion and registrations for online workshop on android development.

Helpage india

Fund Raising Officer

Jan 2015Mar 2015 · 2 mos · Kolkata, West Bengal, India · On-site

  • Raising funds for helping the aged and the poor.

Education

Rochester Institute of Technology

Master of Science - MS — Computer Science

Aug 2024May 2026

Vellore Institute of Technology

Graduate coursework for Master's degree — Artificial Intelligence

Aug 2023Jul 2024

KIIT - Kalinga Institute of Industrial Technology

Bachelor of Technology — Computer Science

Jul 2017Aug 2021

Rabindra Bharati University, Kolkata

Bachelor of Arts - BA — Music Theory and Composition

Jan 2013Jan 2018

Don Bosco School Liluah

ISC — Computer Science

Jan 2015Jan 2017

Vidyaniketan

ICSE — Computer Application

Jan 2006Jan 2015

Aditya Birla Public School

Primary School

Jan 2002Jan 2006

Stackforce found 100+ more professionals with Cloud Computing & Distributed Systems

Explore similar profiles based on matching skills and experience