Krishna Kumar Singh

DevOps Engineer

Greater Delhi, Delhi, India16 yrs 10 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • 16+ years in Cloud and DevOps architecture
  • Expertise in AI and LLM infrastructure
  • Proven track record in cost optimization and automation
Stackforce AI infers this person is a Cloud and DevOps Architect specializing in FinTech and AI Infrastructure.

Contact

Skills

Core Skills

Llm InfrastructureDevopsSreObservabilityInfrastructure AutomationInfrastructure Management

Other Skills

AWX ProjectAlertmanagerAmazon Web Services (AWS)AnsibleArgoCDDocker ProductsEKSELKElastic Stack (ELK)GrafanaGraylogH100/H200/A10 GPUsHelmHelm (Software)Jenkins

About

I am a Cloud, DevOps, and AI Infrastructure Architect with 16+ years of experience building large-scale, reliable, and automated platforms across FinTech, eCommerce, and SaaS. My expertise spans AWS, Kubernetes (EKS), Terraform, Ansible, CI/CD, Observability, SRE, and most recently LLM/AI Infrastructure and GPU systems.At Paytm, I lead the design and operations of enterprise-grade LLM inference platforms, deploying and optimizing large models such as LLaMA, DeepSeek, and 70B+ parameter LLMs. I specialize in vLLM, Triton, TensorRT-LLM, sglang, and OpenAI-compatible APIs built using LiteLLM. My work includes deep GPU-level optimization (A10/H100/H200), quantization, KV-cache tuning, tensor parallelism, batching strategies, and engineering ultra-low latency (<10ms) multi-model serving pipelines.I architect and manage multi-cluster EKS environments, automated AMI lifecycles (Packer + Terraform), GitOps deployments via ArgoCD/Helm, blue/green rollouts, and zero-downtime upgrades. I also build unified Observability stacks using Prometheus, Grafana, Alertmanager, ELK, and MCP-based AI observability for automated anomaly detection and intelligent RCA.Before Paytm, I scaled infrastructure and automation for Info Edge (Naukri, 99acres, Jeevansathi), implemented CI/CD and hybrid cloud environments for Dineout, and drove AWS modernization for Elara (Housing, PropTiger, Makaan). Earlier roles across Petronia and Infinite Technologies strengthened my foundations in Linux, networking, and high-availability operations.I enjoy building self-healing, cost-efficient, and AI-ready cloud platforms, leading engineering teams, and enabling businesses to adopt next-generation AI and DevOps capabilities with confidence.

Experience

Paytm

Paytm — Senior DevOps Manager (AI & Cloud Platform)

Nov 2021Present · 4 yrs 4 mos · Noida · On-site

  • ⚡ AI / LLM / GPU Infrastructure
  • Improved inference success rate by 35%, reduced GPU idle time by 40%.
  • Designed Paytm’s end-to-end AI & LLM infrastructure, enabling production deployment of large FinTech-focused models.
  • Built high-throughput inference platforms using vLLM, Triton, TensorRT-LLM, sglang, achieving sub-10ms latency.
  • Deployed & optimized LLaMA, DeepSeek, 70B+ LLMs with quantization, KV-cache tuning, tensor parallelism & graph optimizations.
  • Engineered GPU clusters on A10/H100/H200, improving throughput, memory efficiency & token/sec by 40–70%.
  • Implemented OpenAI-compatible API gateways using LiteLLM for unified model access.
  • Built MCP-based AI Observability for automated anomaly detection & intelligent RCA.
  • Achieved ~28% GPU & cloud savings via batching, scheduling & autoscaling optimizations.
  • Automated AI platform provisioning using Terraform, Packer, Ansible, including GPU node lifecycle.
  • Tech Stack: vLLM, Triton, TensorRT-LLM, sglang, H100/H200/A10 GPUs, EKS, Terraform, Helm, ArgoCD, Prometheus, Grafana, ELK, Packer, LiteLLM, MCP
  • ☸ Kubernetes & Cloud Automation
  • Architected and managed multi-cluster EKS with GitOps, Helm, ArgoCD & self-healing deployments.
  • Automated patching, AMI lifecycle & OS hardening with Packer + Terraform.
  • Built IaC for EKS, EC2, VPC, RDS, ALB, IAM & S3.
  • ☁ AWS Architecture & FinOps
  • Delivered 28% cloud cost reduction through GPU right-sizing, autoscaling & compute optimization.
  • Designed secure VPC, routing & multi-AZ failover.
  • 📊 SRE & Observability
  • Built unified monitoring using Prometheus, Grafana, ELK, Alertmanager.
  • Implemented GPU-level metrics for SLO compliance.
  • 👥 Leadership
  • Led & mentored 10+ Cloud/DevOps engineers.
vLLMTritonTensorRT-LLMsglangH100/H200/A10 GPUsEKS+11

Info edge india limited. (naukri.com group)

Info Edge India Ltd. — Technical Operations Architect

Mar 2018Nov 2021 · 3 yrs 8 mos · Noida · On-site

  • ⚙ Hybrid Cloud & Infrastructure
  • Led production operations across On-prem, OpenStack, Xen & AWS for Naukri, Jeevansathi, 99acres & Shiksha.
  • ☸ IaC & Automation
  • Automated provisioning of 300+ servers/day using Terraform + Ansible.
  • Auto-generated CI/CD pipelines tied to infrastructure creation.
  • 🔗 Networking & Load Balancing
  • Implemented HA microservices routing via HAProxy, LVS, Nginx & Consul.
  • 📊 Observability
  • Built monitoring using ELK, Grafana, Zabbix.
NginxOpenStackAWX ProjectDocker ProductsHelm (Software)Prometheus.io+12

Times internet

Dineout (Times Internet) — DevOps Engineer (Dec 2015 – Mar 2018)

Dec 2015Mar 2018 · 2 yrs 3 mos · Noida · On-site

  • Managed hybrid infrastructure across AWS and on-prem, supporting mission-critical restaurant discovery and booking platforms.
  • Modernized application deployments through containerization and microservices using Docker.
  • Designed and maintained CI/CD pipelines in Jenkins, automating 200+ jobs spanning builds, deployments, continuous testing, backups, and infra diagnostics.
  • Automated provisioning and configuration using Ansible to ensure reproducible environments.
  • Implemented HA and load-balanced architectures using ELB, HAProxy, and Nginx to ensure platform stability during peak events.
  • Tuned performance and improved resilience across services.
  • Built centralized and scalable logging using Graylog2, enabling quick RCA and multi-environment troubleshooting.
  • Integrated New Relic for comprehensive APM visibility and anomaly detection.
  • Implemented SonarQube for automated code-quality checks across engineering teams.
  • Worked closely with Development and QA teams to streamline release cycles, reduce deployment times, and improve delivery reliability.
  • Supported production operations, scaling initiatives, and day-to-day infra stability.
AWX ProjectDocker ProductsAnsibleTeam ManagementDevOpsElastic Stack (ELK)+6

Elara group (housing, proptiger & makaan)

Elara Group (Housing, PropTiger & Makaan) – Member of Technical Staff -DevOps | Dec 2013 – Oct 2015

Dec 2013Oct 2015 · 1 yr 10 mos · Noida · On-site

  • Automated end-to-end build and release pipelines using Jenkins, improving deployment speed and reliability.
  • Managed AWS environments (EC2, ELB, ASG, RDS, VPC, S3) for high-traffic real-estate platforms.
  • Achieved 30% AWS cost reduction through automation and right-sizing.
  • Maintained Tomcat, Apache, Nginx, Solr, and backend stacks with 24×7 availability and zero-downtime rollouts.
  • Built monitoring using Graylog2, S3, Nagios, and New Relic; handled RCA, incidents, and performance tuning.
  • Supported both on-prem and AWS hybrid infra, ensuring scalable, highly available environments.
  • Coordinated with Dev, QA, and Infra teams for architecture improvements, stable releases, and performance optimization.
Docker ProductsAnsibleTeam ManagementDevOpsElastic Stack (ELK)Amazon Web Services (AWS)+2

Petronia technologies

Petronia Technologies — Technical Team Lead (Oct 2009 – Dec 2013)

Oct 2009Dec 2013 · 4 yrs 2 mos · Noida · On-site

  • Clients & Duration:
  • Pine Labs — System Admin (2009–2010)
  • Vriti / Bagittoday — System Admin (2010–2011)
  • Clickable / Syncapse — DevOps & AWS (2011–2013)
  • Guavus — Data Center Support (2013)
  • Ixigo / Indiahomes — DevOps (AWS) (2013)
  • Key Contributions:
  • Provided 24×7 production support across multiple enterprise clients.
  • Managed Linux servers, networking, and hybrid cloud environments.
  • Automated deployments and system maintenance with scripting.
  • Built monitoring, backup systems, and handled RCA.
  • Coordinated client delivery, vendors, documentation, operations.
NginxDocker ProductsTeam ManagementZabbixDevOpsInfrastructure Management

Infinite technologies ncr

Infinite Technologies — Senior Linux System Administrator (Mar 2009 – Sep 2009)

Mar 2009Sep 2009 · 6 mos · Greater Delhi Area · On-site

  • Provided 24×7 technical support for Linux servers and core IT systems.
  • Managed user, group, and access administration, including quotas and permissions.
  • Performed data backup and recovery using tar, dump, and filesystem tools.
  • Installed and maintained local networks and network devices for smooth operations.
  • Handled Linux service configuration and troubleshooting to ensure system stability.
Nginx

Education

Dr. Ram Manohar Lohia Avadh University (RMLAU), Faizabad (Ayodhya)

Bachelor of Arts — ENGLISH LANGUAGE AND LITERATURE/LETTERS

Jan 2006Jan 2009

Stackforce found 100+ more professionals with Llm Infrastructure & Devops

Explore similar profiles based on matching skills and experience