Bushniel A.

SRE (Site Reliability Engineer)

Kitchener, Ontario, Canada1 yr 4 mos experience

Key Highlights

Over 10 years of experience in Site Reliability Engineering.
Expert in AWS infrastructure management and optimization.
Proven track record of enhancing observability and reliability.

Stackforce AI infers this person is a Site Reliability Engineer with extensive experience in SaaS and Fintech infrastructure management.

Contact

Skills

Core Skills

Site Reliability EngineeringAws Infrastructure ManagementMonitoringObservabilityAutomationUptime ManagementCi/cd ManagementCost OptimizationMetrics ManagementNetwork SecurityInfrastructure ManagementNetwork DesignReliability EngineeringTransaction ReliabilityDns ManagementServer ManagementAlerting SystemsClient Management

Other Skills

AWSAmazon EKSAmazon Web Services (AWS)AnsibleArgocdBashBitbucketBusiness ContinuityCICI/CDCommunicationContainerizationDNSDatadogDevOps

About

With over 10 years of industry experience, I am a skilled and certified Site Reliability Engineer who can architect, build, manage, troubleshoot, and optimize complex infrastructures hosted on AWS cloud. I have a strong background in Linux, devops, and cloud platforms, and I can bridge the gap between development and operational teams. In my most recent role at Bluescape, I led the POC and production implementation of Victoria Metrics to solve the long-term metrics storage issue with Prometheus thereby increasing the metrics storage from 10days to 90days on our EKS clusters

Experience

Metric gaming

Senior Site Reliability Engineer

Apr 2024 – Present · 1 yr 11 mos · Working from Kitchener, Ontario, Canada - London Area, United Kingdom · Remote

As an SRE, I manage AWS EKS infrastructure using IAC (Terraform) for a high-performance betting platform, leveraging applications written in Go communicating via gRPC, Linkerd, ArgoCD, Bitbucket pipelines, Karpenter, Datadog, MSK Kafka, Kong, and Cloudflare to ensure scalability, reliability, and observability.
Key Achievements:
Successfully migrated our monitoring stack from Prometheus/Grafana to Datadog, enabling distributed tracing, expanding observability by 60%, and reducing MTTR by 35%.
Automated Linkerd certificate rotation with cert-manager, delivering 100% uptime during renewals and saving over 20 engineer-hours per quarter.
Optimized CI/CD by implementing Bitbucket runner autoscaler, reducing deployment time by 40% while cutting infra costs by 25%.
Led training programs for 10+ engineers, closing knowledge gaps and reducing dependency on senior staff by 30%.

linkerdArgocdAmazon EKSKubernetesTerraformDatadog+2

Bluescape

Senior Site Reliability Engineer

Nov 2022 – Dec 2023 · 1 yr 1 mo · Kitchener, Ontario, Canada · Remote

At Bluescape, I worked as a Site Reliability Engineer managing large-scale infrastructure hosted on AWS Cloud, including multiple AWS accounts, VPCs, Kubernetes clusters, and databases (RDS, MongoDB Atlas, DynamoDB, ElastiCache). My responsibilities spanned incident management, cost optimization, CI/CD pipelines, and security compliance, while ensuring system reliability and scalability.
Major Achievement:
Spearheaded the production deployment of VictoriaMetrics, solving Prometheus scaling limitations and extending metrics retention from 10 days to 90 days, significantly improving long-term observability.

Technical DesignDistributed SystemsAmazon Web Services (AWS)MicroservicesIncident ResponseScalability+18

Caseware international inc.

Cloud Infrastructure Engineer

Nov 2021 – Jul 2022 · 8 mos · Toronto, Ontario, Canada · Remote

At Caseware International, I worked as a Site Reliability Engineer managing infrastructure on AWS Cloud, supporting global engineering teams, and ensuring reliable, secure connectivity. I participated in on-call rotations to handle Jira tickets and delivered infrastructure improvements through sprint-based planning.
Key Achievements:
Designed and deployed a VPN client solution for Caseware’s engineering team in Ukraine, reducing latency and improving security when connecting to AWS and the Canadian datacenter.
Successfully re-architected the network infrastructure, eliminating single points of failure by migrating from NAT instances and OpenVPN to NAT Gateways and VPC peering, ensuring inter-region connectivity with zero downtime.

Technical DesignDistributed SystemsAmazon Web Services (AWS)MicroservicesIncident ResponseScalability+21

Razorpay

DevOps Engineer

Jul 2017 – Sep 2021 · 4 yrs 2 mos · Bengaluru Area, India · On-site

One of the founding members of the SRE team at Razorpay, building scalable infra from zero to production-grade systems handling millions of transactions, where I helped build and scale the company’s infrastructure from the ground up. My work spanned architecting AWS-based environments, ensuring PCI compliance, and driving critical infrastructure initiatives that supported Razorpay’s rapid growth in the fintech space.
Designed, provisioned, and managed AWS-based dev/stage/prod infra, meeting the needs of both internal dev teams and external banking partners.
Led a team of L1 engineers, guiding them through sprint-based tasks and mentoring them toward infrastructure goals.
Ensured PCI compliance across infrastructure provisioning, successfully driving annual certification processes with external auditors.
Responded to incidents across Razorpay’s monitoring stack (Prometheus, Grafana, PagerDuty, SumoLogic), handling triage and RCA for production issues.
Major Achievements:
Built cross-network infra connecting Razorpay’s AWS Cloud with HDFC Bank’s bare metal servers, enabling secure VPN connectivity and two-way RDS ↔ MySQL replication that ensured 24/7 transaction reliability.
Solved critical DNS scaling and throttling bottleneck, reducing API failures by 40% and improving uptime to 99.99%.
Identified a kube-router CNI race condition causing pod outages, preventing repeated outages; fix was shared with the open-source community.
https://github.com/cloudnativelabs/kube-router/issues/370#issuecomment-463967949
Optimized Kubernetes cluster join time, cutting node spin-up delays by 60% and reducing infra spend by ~20% monthly using spot instances.
Resolved graceful termination issue in pods, cutting failed requests during scaling by 80% and directly reducing payment failures for customers.

Technical DesignPeople ManagementLeadershipDistributed SystemsMicroservicesIncident Response+19

Dm3 (digital media, marketing & monitoring)

IT Project Coordinator

Jul 2015 – Nov 2016 · 1 yr 4 mos · Dubai · On-site

At DM3, I managed infrastructure across internal servers and production environments hosted on SoftLayer and Telecity datacenters. I supported developers, maintained IBM Domino and local IT systems, and provided remote support across international offices (Egypt, Saudi, Europe)
Key Achievement: Migrated 100+ Linux servers to a new virtualized platform, cutting infrastructure costs by 30% and improving scalability.

Disaster RecoveryCommunicationSystem PerformanceRoot Cause AnalysisLinux

Unique computer systems fze, sharjah

Linux Server Administrator

Jun 2014 – Jul 2015 · 1 yr 1 mo · Sharjah · On-site

At UCS, I worked as a System Administrator, managing Linux and Windows servers across Rackspace, Hivelocity, and Azure. I supported web, database, and mail services, while also maintaining internal IT infrastructure such as Active Directory, VPNs, firewalls, and backups.
Key Achievement: Successfully deployed UCS’s SMS alerting platform for enterprise clients, including the Dubai Ministry of Economy, where I implemented AirWatch MDM for secure device management.

Project ManagementScalabilityDisaster RecoveryCommunicationSystem PerformanceRoot Cause Analysis+1

Apigee

Site Reliability Engineer (Global Operations Center -GOC)

Dec 2012 – Sep 2013 · 9 mos · Bangalore · On-site

At Apigee, I worked as a Site Reliability Engineer, managing large-scale Linux infrastructure on AWS. My responsibilities included provisioning and decommissioning servers, monitoring with Nagios/Thruk/Icinga, and automating patching and configuration with Jenkins, Puppet, and Git. I also collaborated with engineering teams to troubleshoot and resolve production issues while ensuring zero-downtime deployments.
Key Achievement: Improved service reliability by automating server provisioning and patching, reducing deployment-related incidents by ~30% and maintaining seamless customer uptime.

Incident ResponsePython (Programming Language)ScalabilityDisaster RecoveryCommunicationSystem Performance+3

Akamai technologies

Platform Operations Engineer

Mar 2012 – Nov 2012 · 8 mos · Bengaluru, Karnataka, India · On-site

At Akamai, I worked as a Platform Operations Engineer, managing the company’s massive Edge Caching platform of more than 120,000 Linux servers worldwide. My role included incident resolution, cross-team collaboration, and supporting product deployments while ensuring round-the-clock reliability.
Key Achievement: Helped sustain 99.99% uptime across Akamai’s global edge infrastructure, ensuring uninterrupted delivery for enterprise customers.

Incident ResponseScalabilityCommunicationSystem PerformanceRoot Cause AnalysisLinux

Carmatec it solution p ltd.

Senior Linux Systems Administrator

Aug 2008 – Mar 2012 · 3 yrs 7 mos · Bangalore · On-site

At Carmatec, I served as a Senior Linux System Administrator and later as a Team Lead, supporting U.S based datacenters. I managed LAMP servers and acted as a Level 3 escalation point for complex incidents via ticketing, calls, and remote troubleshooting.
Key Achievement: Promoted to Team Lead within a short span, managing 10 engineers and improving incident resolution efficiency by ~20% through better processes and coordination.

Incident ResponsePython (Programming Language)ScalabilityCommunicationSystem PerformanceSystem Monitoring+2

Tli software pvt ltd

Quality analyst

Feb 2006 – Jun 2008 · 2 yrs 4 mos · Greater Bengaluru Area · On-site

At TLI, I worked as a Quality Analyst, ensuring compliance in outbound sales calls, supervising floor operations, and handling customer escalations. I also prepared sales performance reports and trained new agents in company policies and quality standards.
Key Achievement: Enhanced call compliance by 15% and reduced escalations through effective training and supervision.

People ManagementCommunication