Suraj Nayak

DevOps Engineer

Bengaluru, Karnataka, India9 yrs 11 mos experience

Key Highlights

Over a decade of experience in Cloud Infrastructure and DevOps.
Expert in building scalable and reliable platforms across multiple cloud providers.
Proven track record in modernizing legacy systems and driving cloud cost efficiency.

Stackforce AI infers this person is a Cloud Infrastructure and DevOps expert specializing in scalable solutions across multiple cloud platforms.

Contact

Skills

Core Skills

Cloud InfrastructureSre PracticesObservability SolutionsInfrastructure ManagementDevops ExcellenceInfrastructure AutomationMonitoring Platform ImplementationKubernetes ArchitectureMicroservices ManagementCloud Security

Other Skills

Amazon Web Services (AWS)AnsibleBitbucketCloud AutomationCloud ComputingCloud ServicesCommunicationConfiguration ManagementDevOpsDisaster RecoveryDomain ArchitectureGitGitHubGoogle Cloud Platform (GCP)IT Operations

About

With over a decade of experience in Cloud Infrastructure and DevOps, I specialise in building scalable, reliable, and observable platforms across AWS, Azure, and Kubernetes. I’ve led initiatives to modernise legacy systems, implement SRE best practices (SLIs/SLOs, error budgets, chaos engineering), and drive cloud cost efficiency through FinOps and automation. At AiDash and Cargill, I led infrastructure scalability and automation programs using Terraform, Kubernetes, and Ansible, optimising performance and reducing costs. Passionate about operational excellence and cross-functional collaboration, I focus on designing secure, resilient, and data-driven systems that empower teams and ensure business continuity in dynamic environments. ⸻

Experience

9 yrs 11 mos

Total Experience

2 yrs 3 mos

Average Tenure

7 mos

Current Experience

Locus

Principal Engineer

Oct 2025 – Present · 7 mos · Bengaluru, Karnataka, India · Hybrid

Architecting Observability Frameworks: Drive implementation of advanced monitoring and tracing using Prometheus, Grafana, and OpenTelemetry to ensure deep system visibility and proactive issue detection.
Defining SRE Practices: Establish and enforce SLIs, SLOs, and error budgets to balance innovation with reliability, embedding SRE culture across engineering teams.
Cloud Infrastructure Leadership: Lead design and optimization of multi-cloud (AWS/Azure) infrastructure for scalability, reliability, and cost efficiency, aligned with FinOps principles.
Platform Automation & Modernization: Oversee automation through Infrastructure-as-Code (Terraform, Ansible) and modernization of legacy systems into containerized, cloud-native architectures.
Cross-Functional Technical Strategy: Collaborate with product, data, and security teams to build cohesive, end-to-end solutions that align engineering efforts with business goals.
Mentorship & Technical Governance: Mentor engineering teams on DevOps, SRE, and cloud best practices while defining architectural standards and driving engineering excellence.

Production DeploymentDomain ArchitectureDevOpsCloud InfrastructureSRE Practices

Roku

Sr. Software Engineer, Infrastructure

Feb 2024 – Oct 2025 · 1 yr 8 mos · Bengaluru, Karnataka, India · Hybrid

Roku, the leading TV streaming platform in the U.S., Canada, and Mexico based on streaming hours, is committed to transforming global television viewing. My role involves evaluating and implementing advanced monitoring and observability solutions optimized for our unique operational environment. This includes hands-on experience with technologies such as Prometheus, Grafana, the ELK stack, Datadog, Jaeger, and OpenTelemetry, to ensure optimal selection and utilization.
Effective collaboration is paramount. I work closely with development and operations teams to instrument our infrastructure, guaranteeing comprehensive monitoring of all applications and services. This is achieved by adhering to best practices, minimizing overhead while maximizing visibility and actionable insights.
My daily responsibilities encompass configuring alert thresholds, leading incident response, and leveraging observability data for both reactive problem-solving and proactive system improvements. This includes identifying performance bottlenecks and collaborating on scalability enhancements, viewing each challenge as an opportunity for optimization.
Automation is a critical component of my position. I develop scripts and integrations to streamline processes, ensuring seamless integration of our observability solutions with CI/CD pipelines and other core systems.
My responsibilities include: managing AWS production application infrastructure (including networking, EKS & ECS clusters); Kubernetes cluster management; efficient IoT device-to-server communication design and implementation; leveraging specialized skills in Python, Infrastructure as Code, and AWS services to build high-quality production applications; developing cloud computing strategies; creating cloud adoption plans (AWS, GCP, Azure); designing cloud applications; and managing and monitoring cloud environments.

TerraformKubernetesCommunicationobservabilityPython (Programming Language)Amazon Web Services (AWS)+2

Aidash

Staff Engineer

Mar 2022 – Feb 2024 · 1 yr 11 mos · Bengaluru, Karnataka, India

DevOps Expertise for High-Growth Startups
DevOps Leadership and Infrastructure
Driving DevOps Excellence: Orchestrating the deployment, automation, and optimization of cloud-based infrastructure and software delivery pipelines for high-performing startups.
Agile DevOps Leadership: Enabling cross-functional teams to achieve continuous integration and delivery through collaborative workflows, tools, and best practices.
Infrastructure as Code Advocate: Leveraging cutting-edge technologies and infrastructure automation tools (e.g., Terraform, Kubernetes, Ansible) to streamline operations, reduce costs, and enhance reliability.
Scalability, Performance, and Cloud Expertise
Scalability and Performance Optimization * Designing and implementing scalable architectures, performance monitoring strategies, and cloud cost optimization techniques to support rapid growth and optimal resource utilization.
CI/CD Champion: Implementing robust continuous integration and delivery pipelines, enabling startups to rapidly deliver features and enhancements while maintaining quality and stability.
Cloud Architecture Expert: Designing and managing cloud-based architectures ( AWS, Azure, Google Cloud) to maximize uptime, security, and cost efficiency while aligning with business objectives.

TerraformKubernetesCommunicationGitHubProgrammingCloud Computing+6

Cargill

Cloud Engineer

Jan 2020 – Mar 2022 · 2 yrs 2 mos · Bangalore

Implemented a comprehensive monitoring platform using victoriametrics and
established logging and tracing using coralogix for production systems on EKS,
managing a substantial 13 trillion metrics data points.
Worked on Kubernetes architecture, involving a multi-cluster setup using
cilium cluster mesh, end-to-end automation for cluster creations using
Terraform and Helm, and CI/CD pipelines using argoCD and Jenkins.
Planned and executed major upgrades for EKS clusters without downtime
using Terraform and Ansible, implementing disaster recovery (DR) for critical
P0 services.
Led the migration of around 300+ applications from EC2 architecture to
Kubernetes within an impressive 80-day timeframe, ensuring zero application
downtime.
Actively participated in the planning, design, and migration of 600+ services
to Google Cloud Platform (GCP), encompassing GCP projects, environment
setups, automation, GKE cluster and node pool designs, as well as CI/CD
implementation (Jenkins & argoCD), monitoring, logging, and necessary
automation.
Worked on system optimizations at scale to enhance reliability and uptime in
Kubernetes, introducing an organization-wide change management process.
Assisted in the creation of an SRE dashboard to capture all incidents with
severity, managing Root Cause Analyses (RCAs), uptime, alert configurations,
and analytics.
Actively organized and participated in Root Cause Analysis (RCA) and
quarterly reflection meetings to analyze uptime/outages, ensuring the highest
application availability.
Implemented a Slack bot for canary manual promotions (flagger) using Flask
and Slack APIs, contributing to multiple automations to reduce manual
interventions.
Ensured system stability,
documenting and supporting standards and procedures as per the
organization's guidelines.

TerraformKubernetesCommunicationAnsibleLinuxInfrastructure as code (IaC)+15

Microland limited

Cloud Architect

Jun 2016 – Jan 2020 · 3 yrs 7 mos · Bengaluru · On-site

Migrating and managing all microservices-based applications on the AWS EKS cluster
Implementation of various tools like Kafka, logging stack(EFK) and monitoring stack(Thanos) in eks cluster using helm charts
Security using Cloudflare Disaster management plan with Terraform and bash script CI/CD using Jenkins and argo-cd
Load Balancing services using nginx ingress controller and ALB ingress controller
Implementing and managing all contemporary technologies such as Java, and NodeJS applications in kubernetes(EKS)
Configuration management using helm charts for kubernetes clusters for major services
Monitoring and logging using Aws Cloudwatch, Thanos and grafana and Elasticsearch Fluentd and Kibana
Implementation and integration of application tracing using Jaeger
Automation using bash scripting and python
SRE, On-call for all projects, used ELK, New-Relic, Cloudwatch, PagerDuty
Automation by Contributing towards IAC by writing terraform modules, creating images through Packer, writing saltstack formulas for Aerospike, access etc.