C

Chandra Prakash Joshi

CTO

Bengaluru, Karnataka, India13 yrs 8 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Led multi-million dollar cost savings through infrastructure optimization.
  • Expert in Kubernetes and cloud-native platform engineering.
  • Strong background in Site Reliability Engineering practices.
Stackforce AI infers this person is a Cloud Infrastructure and DevOps expert with a focus on Fintech and E-commerce.

Contact

Skills

Core Skills

Kubernetes Platform EngineeringSite Reliability Engineering (sre)Cloud Infrastructure ManagementObservability EngineeringPayment System ReliabilityMonitoring And ObservabilityDevops EngineeringContinuous Integration And Continuous Delivery (ci/cd)Linux System Administration

Other Skills

KubernetesAWSTerraformPrometheusGrafanaSRELokiElasticsearchCouchbaseDBRedisNagiosPythonShell ScriptingJenkinsBitbucket

About

Principal Platform Engineer / Staff SRE with 13+ years of experience designing, migrating, and operating highly reliable, cloud-native platforms across large-scale, latency-sensitive environments. I specialize in Kubernetes platform engineering, Site Reliability Engineering (SRE), and cloud infrastructure on AWS, with a strong focus on operability, cost efficiency, and platform sustainability. At Samsung, I lead platform modernization initiatives, including large-scale migration from Rancher-managed Kubernetes to Amazon EKS. I own platform reference architecture, rollout strategy, and operational standards across environments, enabling engineering teams through self-service platforms, golden paths, and standardized CI/CD pipelines using Terraform, ArgoCD, GitHub Actions, and Helm. I have extensive hands-on experience designing and operating highly available Kubernetes platforms across multiple AWS regions and hybrid environments. My work spans SRE practices such as defining SLIs/SLOs, error budgets, alert hygiene, incident management, and capacity planning, ensuring reliability without compromising latency. I architected centralized observability platforms using Prometheus, Grafana, Loki, and Tempo, implementing golden signal dashboards and distributed tracing with controlled telemetry overhead. By migrating from OpenSearch to Loki and tuning data ingestion pipelines, I delivered significant cost reductions while improving operational visibility. My background includes strong cloud cost governance and FinOps practices, resulting in multi-million-dollar annual savings through infrastructure right-sizing, storage optimization, automated environment shutdowns, and workload-aware scheduling strategies. I also enforce security best practices using RBAC, IAM/IRSA, Vault, and network policies to meet compliance and risk requirements. Earlier in my career, I worked with high-traffic consumer platforms such as Quikr, Goibibo, and Lenskart, where I managed large Linux-based infrastructures, built CI/CD pipelines, automated operations with scripting, and supported production systems handling significant scale and availability requirements. Beyond delivery, I actively mentor engineers, conduct knowledge-sharing sessions, participate in architecture discussions, and explore emerging tools and ideas through hackathons and proof-of-concepts. I am passionate about building reliable platforms that scale with business growth and reduce operational complexity.

Experience

13 yrs 8 mos
Total Experience
--
Average Tenure
9 yrs 8 mos
Current Experience

Samsung electronics

3 roles

Senior Chief Engineer

Promoted

Mar 2020Present · 6 yrs 3 mos

  • As a Senior Chief Engineer/Staff SRE, I lead platform engineering and SRE initiatives for Samsung Ads, a highly latency-sensitive, real-time advertising platform operating at ~3 million RPS. I am responsible for designing, evolving, and operating Kubernetes and cloud platforms that support business-critical workloads where availability, scalability, and latency are core success metrics.
  • I own the platform architecture and migration strategy for moving a hybrid infrastructure (AWS + self-managed data centers) running on Rancher-managed Kubernetes to a fully AWS-native Amazon EKS platform. This includes defining target architecture, phased rollout plans, automation standards, and risk-mitigation strategies to ensure stable migrations without sustained SLO impact.
  • I design and operate highly available Kubernetes platforms across multiple AWS regions and availability zones, implementing workload isolation, topology-aware scheduling, and capacity planning to meet strict reliability and performance requirements. Infrastructure provisioning and lifecycle management are automated using Terraform, ensuring consistency and scalability.
  • I drive SRE practices by defining SLIs, SLOs, and error budgets, improving alert quality, reducing operational toil, and strengthening incident response through standardized runbooks and post-incident reviews.
  • I standardized CI/CD and deployment frameworks using Helm, GitHub Actions, and Jenkins, enabling safe, high-frequency releases via blue/green and canary deployments. I also architected centralized observability using Prometheus, Grafana, Loki, and Tempo, implementing golden-signal dashboards with controlled telemetry overhead.
  • Through platform and observability optimization, infrastructure right-sizing, and automation, I delivered over $1.5M in annual cost savings. I enforce security best practices using RBAC, IAM, Vault, and network policies, and actively mentor engineers while contributing to cross-team architectural discussions.
KubernetesAWSTerraformPrometheusGrafanaSRE+2

Chief Engineer

Promoted

Mar 2018Feb 2020 · 1 yr 11 mos

  • As a Chief Engineer, I worked on Samsung Pay, a highly critical payment platform handling sensitive financial transactions and audit-grade data. I was responsible for designing, operating, and securing self-managed cloud infrastructure on AWS to support payment processing workloads with strict availability, security, and compliance requirements.
  • I managed and scaled self-hosted EC2-based infrastructures, ensuring high availability, fault tolerance, and secure access for payment services. Infrastructure provisioning and lifecycle management were automated using Terraform, enabling consistent and auditable environments aligned with compliance standards.
  • I owned the design and operation of centralized logging and audit pipelines using Elasticsearch, ensuring secure storage, retention, and analysis of critical transaction and compliance logs. I supported forensic analysis and audit requirements by maintaining reliable and searchable log pipelines.
  • I worked extensively with CouchbaseDB and Redis to support low-latency, key-value data access patterns required by payment systems, ensuring performance, availability, and data integrity. I also played a key role in debugging and resolving payment-related production issues by collaborating closely with application and security teams.
AWSTerraformElasticsearchCouchbaseDBRedisCloud Infrastructure Management+1

Lead Software Engineer

Aug 2016Feb 2018 · 1 yr 6 mos

  • I joined Samsung as a Lead Software Engineer (SRE) and was promoted to Senior Lead Engineer within the first year, reflecting strong delivery and ownership across critical platforms. I initially worked on Samsung IoT and Bixby Voice, supporting large-scale, consumer-facing systems with stringent availability, performance, and reliability requirements.
  • I was responsible for designing, operating, and supporting Linux-based infrastructure on AWS and self-managed environments, ensuring high availability, fault tolerance, and operational stability for always-on services. My work included capacity planning, performance tuning, and production support for high-traffic workloads.
  • As part of the SRE function, I implemented automation and infrastructure-as-code practices using Terraform and scripting, reducing manual operations and improving deployment consistency. I contributed to CI/CD pipelines and supported safe production releases for frequently changing services.
  • I built and maintained monitoring and alerting systems using Prometheus, Grafana, Nagios, and commercial APM tools, enabling proactive detection of failures and performance degradation. I worked closely with development teams to debug production issues, perform root-cause analysis, and improve system resilience through post-incident reviews.
TerraformPrometheusGrafanaNagiosSite Reliability Engineering (SRE)

Quikr

Senior DevOps Engineer

Jan 2015Aug 2016 · 1 yr 7 mos · Bengaluru Area, India · On-site

  • At Quikr, a large-scale online classifieds platform, I was part of the Infrastructure Management team responsible for designing, operating, and supporting high-availability production environments serving high-traffic consumer workloads. The infrastructure primarily leveraged Linux/UNIX, AWS, and open-source technologies, aligned with business availability and performance requirements.
  • I worked on automating operational tasks and infrastructure workflows using Python and shell scripting, reducing manual effort and improving system reliability. I was involved in deploying and managing production environments using technologies such as Apache, Nginx, HAProxy, and Elasticsearch, along with NoSQL data platforms including MongoDB and Couchbase to support scalable application workloads.
  • I configured and managed CI/CD pipelines using Jenkins and Bitbucket/Git, enabling automated builds and deployments. My responsibilities included OS-level security hardening, performance tuning, system monitoring, and infrastructure upgrades across both physical and virtualized environments.
  • I supported a wide range of Linux distributions (RHEL, CentOS, Ubuntu) and managed core services such as MySQL, web servers, caching layers, and messaging components. I was also responsible for provisioning and maintaining physical servers, virtual machines on VMware ESX, and cloud instances on AWS, including network and access configuration.
  • As part of the on-call rotation, I handled production incidents, technical escalations, and cross-team coordination to ensure rapid issue resolution. I regularly analyzed system bottlenecks and recommended infrastructure and architectural improvements to enhance performance, stability, and operational efficiency.
PythonShell ScriptingJenkinsBitbucketDevOps Engineering

Ibibo group

Linux Engineer

Mar 2014Jan 2015 · 10 mos · Gurgaon, India · On-site

  • At ibibo Group, I was responsible for operating and supporting production Linux and cloud infrastructure for high-traffic consumer applications. My role focused on ensuring system availability, performance, and security across on-premise and AWS environments supporting business-critical workloads.
  • I managed day-to-day Linux administration, including user and access management, file systems, performance monitoring, system security, authentication, and scheduled job management. I supported CDN operations, cache invalidation (Varnish), and FTP services, ensuring smooth content delivery and operational stability.
  • I worked on CI/CD tooling using Hudson and Git, automating operational and deployment tasks through shell scripting and supporting frequent production releases. I was involved in provisioning and hardening operating systems, including RAID configuration, OS installation, and security baselining for production systems.
  • I contributed to the design, engineering, and administration of Linux and cloud infrastructure on AWS, managing services such as EC2, VPC, Route53, CloudFront, S3, Auto Scaling, and IAM. I supported physical-to-cloud migration efforts, helping teams transition workloads to scalable and resilient cloud architectures.
  • My responsibilities also included configuring and operating core application infrastructure components such as LAMP/LEMP stacks, Apache, Nginx, Nagios, Splunk, Varnish, Memcached, and HAProxy, providing monitoring, performance visibility, and operational support for production systems.
LinuxAWSApacheNginxLinux System Administration

Lenskart.com

2 roles

Linux Engineer

Jul 2012Feb 2014 · 1 yr 7 mos · New Delhi Area, India

  • I started my career at Lenskart as a Linux Engineer, where I built a strong foundation in production infrastructure, operations, and reliability for high-traffic e-commerce systems. I worked closely with Release Operations, Development, and QA teams to support frequent production releases and ensure system stability and performance.
  • I was responsible for configuring, operating, and monitoring production Linux servers, using tools such as Nagios, New Relic, Splunk, and Google Analytics to track system health, application performance, and user behavior. My day-to-day work included log rotation and analysis, network troubleshooting, and performance monitoring across CPU, memory, disk, and I/O.
  • I supported application infrastructure based on LAMP/LEMP stacks, managing Apache and Nginx servers behind hardware and software load balancers. I also worked extensively with caching and content delivery systems, including Varnish, Akamai, and AWS CloudFront, handling cache purges and traffic optimization.
  • My responsibilities included backup and recovery operations using MySQL backup and replication, rsync, and AWS S3, as well as package management using RPM and YUM. I configured and maintained CI/CD tooling using Hudson and Git, supporting build and release workflows.
  • I managed DNS records across multiple platforms, including AWS Route53 and internal DNS systems, and coordinated with external service providers following defined escalation processes to resolve production issues.
  • During this period, I invested heavily in building deep Linux expertise and earned internationally recognized certifications including RHCSA, RHCE, and RHCA, which laid the foundation for my later growth into cloud, platform engineering, and SRE roles.
LinuxAWSMySQLLinux System Administration

Trainee

Jan 2012Jul 2012 · 6 mos · New Delhi Area, India

Education

Dehradun Institute of Technology

Master of Computer Application (MCA) — Computer Science

Jan 2009Jan 2012

S.S.J Campus Almora

Bachelor of Computer Application (BCA) — Computer Science

Jan 2006Jan 2009

Stackforce found 100+ more professionals with Kubernetes Platform Engineering & Site Reliability Engineering (sre)

Explore similar profiles based on matching skills and experience