Pawan Alwandi

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India17 yrs 6 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Led 35+ globally distributed SRE teams.
  • Achieved significant performance gains and cost savings.
  • Developed innovative solutions for complex challenges.
Stackforce AI infers this person is a Site Reliability Engineer with extensive experience in SaaS infrastructure management.

Contact

Skills

Core Skills

Site Reliability EngineeringInfrastructure ManagementCloud InfrastructureCost ManagementAccountability CulturePlatform StabilitySecurity EngineeringPlatform ResilienceInfrastructure EfficiencyTeam LeadershipEfficiency ImprovementInfrastructure DevelopmentOperational Efficiency

Other Skills

AWSAlgorithmsApple HLSAutomationAutomation ToolsAzureCCapacity PlanningCephCost OptimizationDDoS MitigationDVBDebuggingEmbedded SystemsGCP

About

I’m a seasoned Site Reliability Engineering (SRE) leader with 16+ years of hands-on experience solving complex challenges across the stack—from Linux internals and networking to large-scale distributed systems and applications written in Python, Go, and C++. I’ve led 35+ globally distributed SRE teams across Asia Pacific, EMEA, and the Americas, driving reliability, scale, and operational maturity across time zones and cultures. I’ve operated fleets ranging from tens of thousands of VMs to hundreds of thousands of edge servers. I don’t believe moving fast has to mean breaking things. I believe in moving fast safely—through automation, rigorous testing, progressive rollout, and resilient guardrails. Technical depth: - I believe I can dive into any complex problem—across any system, any layer of the stack, and any codebase. I thrive in ambiguity and bring relentless focus to resolving hard problems. Solutions may take time, but they’re never out of reach. I deeply grok the systems I work on, often identifying problematic components—or even the root cause—just from surface-level symptoms. - I can move seamlessly from high-level insights (e.g. Metabase dashboards) to low-level debugging—digging into packet traces (tcpdump), system calls (strace), and performance metrics (vmstat, mpstat, sar, free, etc.) as needed. Whether it's infrastructure, networking stack, OS behavior, or application performance, I know how to trace a problem to its source. Leadership and Management: - I practice adaptive leadership, adopting my style to meet the needs of the team and the moment—whether that’s servant leadership to coach and empower, directive leadership during high-profile incidents to drive clarity and action, transformational leadership to inspire innovation and change, or delegative leadership to enable autonomy among seasoned SREs. - My leadership is rooted in empathy, accountability, and outcomes—driving impact while putting people first. - I believe teams do their best work when they are trusted, empowered, and challenged. I’ve consistently created such environments and earned strong influence and rapport within and across teams.

Experience

Platform.sh

4 roles

Site Reliability Eningeering

Jan 2025Present · 1 yr 2 mos · Remote

  • Execution-focused leadership within the broader SRE organisation. Manage deployment scheduling, implement core SRE principles including SLOs/SLIs, and lead multiple SRE-driven initiatives to reduce COGS and improve infrastructure efficiency.
  • Accomplishments:
  • Migrated workloads on Azure and GCP to AMD-based instances, enabling SMT and mitigating processor side-channel vulnerabilities: This led to faster HTTP response times, lower CPU usage across regions, and the decommissioning of 20–40% of hosts—achieving significant performance gains and cost savings.
  • Structured and enriched monthly SRE updates to executives—improving visibility into infrastructure health, incident trends, reactive vs. proactive workload balance, and key team operational metrics.
  • Defined and implemented SLOs/SLIs across the edge, orchestration, and storage layers—establishing clear reliability targets and driving a culture of accountability.
  • Infrastructure scale:
  • → Adobe commerce cloud (Platform.sh managed): 25,000+ VMs across public cloud providers (AWS, Azure, GCP).
  • → Adobe starter commerce cloud: Kubernetes-like PaaS platform orchestrating 1,800+ VMs and 200,000+ LXC containers.
SLOs/SLIsAzureGCPVMsHTTPSite Reliability Engineering+1

Operations & Engineering

Promoted

Jan 2022Dec 2024 · 2 yrs 11 mos · Remote

  • Led a 35+ member team of engineers and SREs—including 3 direct-reporting managers—focused on improving platform stability and reducing COGS across Platform.sh’s PaaS container infrastructure.
  • Accomplishments:
  • Co-developed with our CTO a novel TLS-layer concurrency limiter, the first of its kind, which controls concurrent TLS handshakes using the SNI header. This significantly improved the platform's ability to mitigate abuse and DDoS attacks—reducing monthly incidents from 1–2 to virtually zero. One of my most impactful and rewarding contributions, with lasting effect on platform resilience.
  • Resolved a Ceph RBD volume leak caused by Copy-on-write (CoW) snapshot relationships, preventing $9,000+/month in storage waste.
  • Conducted a thorough review and recommended removal of two disaster recovery snapshots without raising Ceph replica counts, leveraging existing data replication across AWS partition placement groups. This optimization delivered over $1M in cost savings improving the company margins by 2-3 basis points!
  • Established a regular release cadence for platform components, improved testing and validation processes, and introduced phased rollouts—accelerating delivery velocity while maintaining platform safety and reliability.
TLSDDoS MitigationCephCost OptimizationSite Reliability EngineeringPlatform Stability

Site Reliability Engineering Manager

Jul 2018Dec 2021 · 3 yrs 5 mos · Remote

  • Led a team of 12 high-performing SREs across Asia Pacific and EMEA, building and operating a modern, hybrid-cloud, container-based PaaS.
  • Accomplishments:
  • Grew the APAC SRE team from 2 to 6 engineers, and within a year, took on leadership of the EMEA region — managing a combined team of 12 SREs across multiple time zones.
  • Designed and built the first PaaS region upgrade automation, reducing upgrade time by 50% (from 8 hours to 4) by automating a complex 20-step manual workflow.
  • Led the rollout of enterprise and standard stacks on Google Cloud and OpenStack, modernizing platform support and tooling to enable the company’s next wave of growth across multiple cloud environments.
  • Key role in shaping platform direction — contributing to policy decisions around capacity planning and cost optimization to help manage and reduce COGS.
KubernetesGoogle CloudOpenStackCapacity PlanningSite Reliability EngineeringTeam Leadership

Founding Site Reliability Engineer

May 2016Jun 2018 · 2 yrs 1 mo · Remote

  • First SRE hire. Heavy hands-on. Fast-moving. Zero red tape. Built core infra that scaled with the company.
  • Accomplishments:
  • Built the first automation tools for backups and instance resizes — taking repetitive, manual work off engineers' plates from day one.
  • Developed Python-based enterprise provisioning tools for AWS (EC2, EBS, VPC, S3), cutting environment setup time from 5 hours to just 45 minutes. We went from provisioning 1–2 environments per day to several — massively accelerating provisioning velocity.
  • Enterprise stack bring up on Azure using Virtual Machines, Azure blob storage, ARM templates, and un-managed disks — unlocking million dollar enterprise deals.
  • Handled gnarly L3 support escalations, staying close to production and solving the most difficult customer issues.
Automation ToolsAWSPythonSite Reliability EngineeringInfrastructure Development

Akamai technologies

3 roles

Lead Software Engineer

Promoted

Apr 2014May 2016 · 2 yrs 1 mo · Bengaluru Area, India

  • Lead stability engineering efforts within MCDN servers team from Bangalore.

Senior Software Engineer, Stability / Reliability Engineering

Promoted

Apr 2013Mar 2014 · 11 mos · Bengaluru Area, India

  • Troubleshooting the highly distributed, fault-tolerant, large scale streaming production network.
  • Code-level debugging of production anamolies to ensure availability and efficiency of Akamai HD and Flash streaming networks.
  • Troubleshoot production escalations from Tier 2 and provide timely responses by deep diving in to source code, analyzing network traces, reviewing product configurations and settings.
  • Proactive analysis and continuous improvements to reduce latency, increase scalability of the Akamai HD VOD network.
  • Part of a team handling periodic on-call duty.

System Software Engineer

May 2012Apr 2013 · 11 mos · Bengaluru Area, India

  • Working on Streaming/Edge-stability initiatives in the Media and CDN Engineering team.

Sling media

Software Engineer

Jul 2011May 2012 · 10 mos · Bengaluru Area, India

  • Design & development of new features and bug fixing on Apple HLS module.
  • Improved the existing SSM algorithm (server side adaptive bit rate streaming mechanism) on HLS to estimate the network throughput more accurately. (Achieved normalization in the bit rate fluctuations, leading to better user experience).
  • Worked on TI's NDK (Network Developer’s Kit) to fix issues related to initial retransmission timeout (RTO). This improved the HLS streaming experience in low bandwidth networks (esp. in Relay mode).

Nds limited

2 roles

Senior Software Engineer

Apr 2011Jun 2011 · 2 mos · Bengaluru Area, India

  • Maintenance of SI Management component.
  • Code rework on component (File Delivery Manager) responsible for different carousel systems (DSM-CC and MHC).

Software Engineer

Jul 2008Apr 2011 · 2 yrs 9 mos · Bengaluru Area, India

  • Development and debugging of Middleware components responsible for BackEnd Management, A/V Presentation, SI Management etc.
  • Extensive work on a component responsible for filtering, storage and maintenance of (PSI/SI) Program Specific Information/Service Information from the broadcast stream in a STB.
  • Support integration and testing teams for various projects.

Avaya (nortal networks)

Intern

Feb 2008Jun 2008 · 4 mos

  • “Enhancement of Ntop for IPFIX compliance"
  • Implementation for parsing of IPFIX export packets.
  • Implementation for Identification of template flowset, data flowset, data records etc from the packets and extraction of Flow information like packet count (for host monitoring), type of service and protocol used (for QoS) from data records.
  • Implementation of a PERL script for extraction of IP flow information from a temporary file and storing it onto MySQL database using Perl Database Interface (DBI).

Education

Sri Jayachamarajendra College Of Engineering

Bachelor of Engineering - BE — Computer Science

Stackforce found 100+ more professionals with Site Reliability Engineering & Infrastructure Management

Explore similar profiles based on matching skills and experience