B

Bushniel A.

SRE (Site Reliability Engineer)

Kitchener, Ontario, Canada1 yr 4 mos experience

Key Highlights

  • Over 10 years of experience in Site Reliability Engineering.
  • Expert in AWS infrastructure management and optimization.
  • Proven track record of enhancing observability and reliability.
Stackforce AI infers this person is a Site Reliability Engineer with extensive experience in SaaS and Fintech infrastructure management.

Contact

Skills

Core Skills

Site Reliability EngineeringAws Infrastructure ManagementMonitoringObservabilityAutomationUptime ManagementCi/cd ManagementCost OptimizationMetrics ManagementNetwork SecurityInfrastructure ManagementNetwork DesignReliability EngineeringTransaction ReliabilityDns ManagementServer ManagementAlerting SystemsClient Management

Other Skills

AWSAmazon EKSAmazon Web Services (AWS)AnsibleArgocdBashBitbucketBusiness ContinuityCICI/CDCommunicationContainerizationDNSDatadogDevOps

About

With over 10 years of industry experience, I am a skilled and certified Site Reliability Engineer who can architect, build, manage, troubleshoot, and optimize complex infrastructures hosted on AWS cloud. I have a strong background in Linux, devops, and cloud platforms, and I can bridge the gap between development and operational teams. In my most recent role at Bluescape, I led the POC and production implementation of Victoria Metrics to solve the long-term metrics storage issue with Prometheus thereby increasing the metrics storage from 10days to 90days on our EKS clusters

Experience

Metric gaming

Senior Site Reliability Engineer

Apr 2024Present · 1 yr 11 mos · Working from Kitchener, Ontario, Canada - London Area, United Kingdom · Remote

  • As an SRE, I manage AWS EKS infrastructure using IAC (Terraform) for a high-performance betting platform, leveraging applications written in Go communicating via gRPC, Linkerd, ArgoCD, Bitbucket pipelines, Karpenter, Datadog, MSK Kafka, Kong, and Cloudflare to ensure scalability, reliability, and observability.
  • Key Achievements:
  • Successfully migrated our monitoring stack from Prometheus/Grafana to Datadog, enabling distributed tracing, expanding observability by 60%, and reducing MTTR by 35%.
  • Automated Linkerd certificate rotation with cert-manager, delivering 100% uptime during renewals and saving over 20 engineer-hours per quarter.
  • Optimized CI/CD by implementing Bitbucket runner autoscaler, reducing deployment time by 40% while cutting infra costs by 25%.
  • Led training programs for 10+ engineers, closing knowledge gaps and reducing dependency on senior staff by 30%.
linkerdArgocdAmazon EKSKubernetesTerraformDatadog+2

Bluescape

Senior Site Reliability Engineer

Nov 2022Dec 2023 · 1 yr 1 mo · Kitchener, Ontario, Canada · Remote

  • At Bluescape, I worked as a Site Reliability Engineer managing large-scale infrastructure hosted on AWS Cloud, including multiple AWS accounts, VPCs, Kubernetes clusters, and databases (RDS, MongoDB Atlas, DynamoDB, ElastiCache). My responsibilities spanned incident management, cost optimization, CI/CD pipelines, and security compliance, while ensuring system reliability and scalability.
  • Major Achievement:
  • Spearheaded the production deployment of VictoriaMetrics, solving Prometheus scaling limitations and extending metrics retention from 10 days to 90 days, significantly improving long-term observability.
Technical DesignDistributed SystemsAmazon Web Services (AWS)MicroservicesIncident ResponseScalability+18

Caseware international inc.

Cloud Infrastructure Engineer

Nov 2021Jul 2022 · 8 mos · Toronto, Ontario, Canada · Remote

  • At Caseware International, I worked as a Site Reliability Engineer managing infrastructure on AWS Cloud, supporting global engineering teams, and ensuring reliable, secure connectivity. I participated in on-call rotations to handle Jira tickets and delivered infrastructure improvements through sprint-based planning.
  • Key Achievements:
  • Designed and deployed a VPN client solution for Caseware’s engineering team in Ukraine, reducing latency and improving security when connecting to AWS and the Canadian datacenter.
  • Successfully re-architected the network infrastructure, eliminating single points of failure by migrating from NAT instances and OpenVPN to NAT Gateways and VPC peering, ensuring inter-region connectivity with zero downtime.
Technical DesignDistributed SystemsAmazon Web Services (AWS)MicroservicesIncident ResponseScalability+21

Razorpay

DevOps Engineer

Jul 2017Sep 2021 · 4 yrs 2 mos · Bengaluru Area, India · On-site

  • One of the founding members of the SRE team at Razorpay, building scalable infra from zero to production-grade systems handling millions of transactions, where I helped build and scale the company’s infrastructure from the ground up. My work spanned architecting AWS-based environments, ensuring PCI compliance, and driving critical infrastructure initiatives that supported Razorpay’s rapid growth in the fintech space.
  • Designed, provisioned, and managed AWS-based dev/stage/prod infra, meeting the needs of both internal dev teams and external banking partners.
  • Led a team of L1 engineers, guiding them through sprint-based tasks and mentoring them toward infrastructure goals.
  • Ensured PCI compliance across infrastructure provisioning, successfully driving annual certification processes with external auditors.
  • Responded to incidents across Razorpay’s monitoring stack (Prometheus, Grafana, PagerDuty, SumoLogic), handling triage and RCA for production issues.
  • Major Achievements:
  • Built cross-network infra connecting Razorpay’s AWS Cloud with HDFC Bank’s bare metal servers, enabling secure VPN connectivity and two-way RDS ↔ MySQL replication that ensured 24/7 transaction reliability.
  • Solved critical DNS scaling and throttling bottleneck, reducing API failures by 40% and improving uptime to 99.99%.
  • Identified a kube-router CNI race condition causing pod outages, preventing repeated outages; fix was shared with the open-source community.
  • https://github.com/cloudnativelabs/kube-router/issues/370#issuecomment-463967949
  • Optimized Kubernetes cluster join time, cutting node spin-up delays by 60% and reducing infra spend by ~20% monthly using spot instances.
  • Resolved graceful termination issue in pods, cutting failed requests during scaling by 80% and directly reducing payment failures for customers.
Technical DesignPeople ManagementLeadershipDistributed SystemsMicroservicesIncident Response+19

Dm3 (digital media, marketing & monitoring)

IT Project Coordinator

Jul 2015Nov 2016 · 1 yr 4 mos · Dubai · On-site

  • At DM3, I managed infrastructure across internal servers and production environments hosted on SoftLayer and Telecity datacenters. I supported developers, maintained IBM Domino and local IT systems, and provided remote support across international offices (Egypt, Saudi, Europe)
  • Key Achievement: Migrated 100+ Linux servers to a new virtualized platform, cutting infrastructure costs by 30% and improving scalability.
Disaster RecoveryCommunicationSystem PerformanceRoot Cause AnalysisLinux

Unique computer systems fze, sharjah

Linux Server Administrator

Jun 2014Jul 2015 · 1 yr 1 mo · Sharjah · On-site

  • At UCS, I worked as a System Administrator, managing Linux and Windows servers across Rackspace, Hivelocity, and Azure. I supported web, database, and mail services, while also maintaining internal IT infrastructure such as Active Directory, VPNs, firewalls, and backups.
  • Key Achievement: Successfully deployed UCS’s SMS alerting platform for enterprise clients, including the Dubai Ministry of Economy, where I implemented AirWatch MDM for secure device management.
Project ManagementScalabilityDisaster RecoveryCommunicationSystem PerformanceRoot Cause Analysis+1

Apigee

Site Reliability Engineer (Global Operations Center -GOC)

Dec 2012Sep 2013 · 9 mos · Bangalore · On-site

  • At Apigee, I worked as a Site Reliability Engineer, managing large-scale Linux infrastructure on AWS. My responsibilities included provisioning and decommissioning servers, monitoring with Nagios/Thruk/Icinga, and automating patching and configuration with Jenkins, Puppet, and Git. I also collaborated with engineering teams to troubleshoot and resolve production issues while ensuring zero-downtime deployments.
  • Key Achievement: Improved service reliability by automating server provisioning and patching, reducing deployment-related incidents by ~30% and maintaining seamless customer uptime.
Incident ResponsePython (Programming Language)ScalabilityDisaster RecoveryCommunicationSystem Performance+3

Akamai technologies

Platform Operations Engineer

Mar 2012Nov 2012 · 8 mos · Bengaluru, Karnataka, India · On-site

  • At Akamai, I worked as a Platform Operations Engineer, managing the company’s massive Edge Caching platform of more than 120,000 Linux servers worldwide. My role included incident resolution, cross-team collaboration, and supporting product deployments while ensuring round-the-clock reliability.
  • Key Achievement: Helped sustain 99.99% uptime across Akamai’s global edge infrastructure, ensuring uninterrupted delivery for enterprise customers.
Incident ResponseScalabilityCommunicationSystem PerformanceRoot Cause AnalysisLinux

Carmatec it solution p ltd.

Senior Linux Systems Administrator

Aug 2008Mar 2012 · 3 yrs 7 mos · Bangalore · On-site

  • At Carmatec, I served as a Senior Linux System Administrator and later as a Team Lead, supporting U.S based datacenters. I managed LAMP servers and acted as a Level 3 escalation point for complex incidents via ticketing, calls, and remote troubleshooting.
  • Key Achievement: Promoted to Team Lead within a short span, managing 10 engineers and improving incident resolution efficiency by ~20% through better processes and coordination.
Incident ResponsePython (Programming Language)ScalabilityCommunicationSystem PerformanceSystem Monitoring+2

Tli software pvt ltd

Quality analyst

Feb 2006Jun 2008 · 2 yrs 4 mos · Greater Bengaluru Area · On-site

  • At TLI, I worked as a Quality Analyst, ensuring compliance in outbound sales calls, supervising floor operations, and handling customer escalations. I also prepared sales performance reports and trained new agents in company policies and quality standards.
  • Key Achievement: Enhanced call compliance by 15% and reduced escalations through effective training and supervision.
People ManagementCommunication

Education

Btech in Computer Science

Bachelor’s Degree — Computer Science

Jan 2005Jan 2008

Stackforce found 100+ more professionals with Site Reliability Engineering & Aws Infrastructure Management

Explore similar profiles based on matching skills and experience