Vibhav Chary

VP of Engineering

Bengaluru, Karnataka, India22 yrs 2 mos experience

Highly StableAI Enabled

Key Highlights

Achieved 99.99% infrastructure uptime.
Reduced operational costs by over 50%.
Led cross-functional teams to optimize security and performance.

Stackforce AI infers this person is a DevOps and SRE leader in the SaaS industry.

Contact

Skills

Core Skills

SreDevops

Other Skills

AI/MLAWSAWS EKSAmazon Web Services (AWS)Analytical SkillsApache MesosAutomationBusiness RequirementsCapacity analysisCapital ExpendituresChaos engineeringCloud ComputingCommunicationControl Tower architectureCost reduction

About

Design,build,measure,track&maintain an infra uptime of 99.99% with minimum security vulnerabilities. Hands-On, with specialties in Kubernetes, Helm, Istio, Gitlab CI/CD,Linux, Observability, Logging,AWS, DevOps/SRE Tools & Processes, Incident, Problem, Change Management, Project Management, Cloud Migrations, Cost, Database Migrations, CloudMongo, Kong API Gateway

Experience

22 yrs 2 mos

Total Experience

6 yrs 11 mos

Average Tenure

1 yr 4 mos

Current Experience

Fourkites, inc.

Vice President Engineering AI/Agentic/MCP(SRE,DevSecOps,DevOps,Platforms, Infosec,IT)

Feb 2025 – Present · 1 yr 4 mos · Bengaluru, Karnataka, India · Remote

Key Responsibilities :
Strategic Leadership & Vision: Provide visionary leadership in defining and executing a holistic platform strategy that seamlessly integrates Site Reliability Engineering (SRE), DevOps, Security, and Platform Engineering principles.
Intelligent Automation: Drive the development and adoption of advanced agentic workflows, leveraging Model Context Protocol (MCP) and sophisticated AI/ML models to create highly autonomous and intelligent operational frameworks.
Organizational Optimization: Champion initiatives focused on significantly enhancing developer productivity, optimizing system availability and resiliency, and fortifying the overall security posture across the enterprise.
Cross-Functional Integration: Oversee the strategic integration of diverse engineering tools and practices to foster a unified, efficient, and secure development and operations ecosystem.
Innovation & Future-Proofing: Act as a key driver of innovation, continuously exploring emerging technologies and methodologies to ensure our engineering platforms remain at the forefront of industry best practices and future demands.
Team Empowerment: Lead and mentor high-performing architecture and engineering teams, fostering a culture of technical excellence, collaboration, and continuous improvement.
My focus is on translating complex technical challenges into strategic opportunities, delivering solutions that not only meet immediate business needs but also establish a robust, scalable, and secure foundation for future growth.

KubernetesDevSecOpsSREPlatform EngineeringAI/MLDevOps

Niyo solutions inc.

3 roles

Vice President | SRE | DevSecOps | DevOps | Platforms | Cloud | Infra | at NiYO

Promoted

Oct 2023 – Feb 2025 · 1 yr 4 mos

● Design Active/Active architecture entirely on Gitops , with minimal cost
● Self Manage Kafka on Kubernetes at scale. 90% savings on existing burn
● Gitops (Successfully managed an entire bank account with read only access to aws console)
● Built custom logging solution reducing 70% of cost ( 220k $ to 60K $ for 1 TB ingestion per day).
● Improved Developers productivity by moving to DevTron open source platform
● Using Kubecost open source platform, improved my Savings plan utilisation from 60% to 90%
● Lead the development and maintenance of SRE platforms, toolsets, and infrastructure, ensuring they are robust, scalable,
aligned with organizational needs and they improve developers productivity
● Drive the adoption of SRE best practices, including automation, monitoring, and incident response.

GitOpsKafkaKubernetesSRE best practicesCost reductionDevOps+1

Senior Director Of Engineering | SRE | DevSecOps | DevOps | Platforms | Cloud | Infra | at NiYO

Aug 2021 – Oct 2023 · 2 yrs 2 mos

Improve Dev to Devops ratio ( 50:1)
1.Automated Transition from Imported ACM Certificates to Amazon Issued ACM
2.Server Patching and Maintenance Automation:
3.Database Updates and Deletion Workflow Integration:
4.Infrastructure as Code (IaC) with Terraform:Leveraged Terraform for Day Zero infrastructure
provisioning, enabling efficient resource deployment and management.
5.EKS Cluster Upgrades:Managed Kubernetes version upgrades for EKS clusters, ensuring compatibility
and security in an environment where new versions were released every three months.
6.Control Tower Architecture for Enhanced Security:Introduced a Control Tower architecture to eliminate
and reduce security vulnerabilities resulting from DevOps oversights, enhancing overall system
security.
7.Self-Service Handover to Development Teams.Enabled development teams to create NS and EW
routes for new microservices independently, streamlining network management.Microservice
Onboarding with Helm Charts:
8.Developed standardized monitoring templates covering metrics such as 5xx and 4xx responses,
latencies, pod restarts, and Apdex score, empowering development teams to maintain application
health.

TerraformKubernetesAutomationControl Tower architectureMonitoring templatesDevOps+1

Director Of Engineering SRE

Sep 2019 – Aug 2021 · 1 yr 11 mos

Achievements :
1. Moving from Kops to AWS EKS
2. UAT, Beta and Prod on all same version of AWS EKS and Kong
3. Database Migration from CloudMongo to PCI certified Atlas MongoDb
4. Setup Two certified PCI environments from scratch
5. Moving from NewRelic to Datadog to reduce 50% cost
6. Containerisation of Kong Api Gateways
7. Setting up observability end to end from scratch (Infra, Application, Logging, Alerts, Visualization,RCA’s, SLO,Incident and Major Incidents
8. DR Drills Compliance by using chaos engineering
Devops Operational Excellence:
Automated ssl certificate renewal on elb’s/cloudfront via ci/cd
MTTD: 5 mins
MTTR: 30 mins
Meeting developers infra requirements within 48hrs
Managing infrastructure as code via terraform
◦ Kong upgrades
◦ Setting up new infra from scratch (VPC, Subnets, EKS clusters, Eks Nodes, Security groups )
◦ Mutual ssl certificate changes
◦ AWS system manager to automate patching on servers
Costs:
Reduced 50% of aws bill by implementing the below:
1. Savings Plan combined with reservations
2. Using spot instances in Beta/Uat environment
3. Using AMD based processors instead of Intel
4. Judiciously utilising S3 bucket policies
5. Automated removal of unused EBS volumes
6. Covering RDS, Elasticsearch, Elastic cache under reservations
Process Improvements:
Reduced Dev – SRE interactions by 40% by creating FAQ’s
Automated jira tickets by email to track all Dev’s requests
SLA tracking, escalation process of all Dev Requirements
Vendor Negotiations :
Negotiated with Datadog to cover our App/Infra monitoring at 50% of the NewRelic Cost
Centralising the MongoDb Databases , helped save 30% of the costs
Reduced our daily logging from 500GB to 100GB per day, which saved us 60% of the costs
Moving the NOC from Dedicated to Shared model , helped us reduce 60% of the costs

AWS EKSDatabase MigrationObservabilityCost reductionChaos engineeringDevOps+1

Ola (ani technologies pvt. ltd)

Senior Engineering Manager Devops

May 2017 – Aug 2019 · 2 yrs 3 mos · Bengaluru Area, India

Build observability into micro services ecosystem for tracing and debugging
Observability platform:
Monitoring: Prometheus, Sensu
Alerting/visualisation: Grafana, PagerDuty
Distributed systems tracing infrastructure using New Relic
Logging: Graylog
Build Observability on below Infra components:
Mesos Master
Mesos Slaves
Marathon
HAProxy/MLB's
Unbound
PDNS
Git
Artifactory
Mesos-ZK
Kafka
Reduce MTTD(Mean time to Detect) and MTTR(Mean time to Recover) for production issues using the observability platform
Reduce the cost - CPU/ memory/ EBS by analysing capacity
Reduce S3 storage cost using custom boto 3 scripts, AWS Analytics
Explore open source and other options to meet our automation requirements
Taking architectural decisions for building highly available and large scale distributed systems
Kong/Repose to throttle api traffic
Hysterix for Circuit Breakers
HA-Proxy/Ngnix for loadbalancing/Routing
Read / Write Traffic on Database servers
When to use shared vs dedicated database servers
What metrics to monitor
Using Redis cache to give breather for databases
Experience in Sprint cycles / Planning using Jira
Interacting with Internal and External Auditors for ITGC
Auditing IAM Users ( Active and De-Activated Users)
Access controls to AWS Infra
Change Management Policy
Revision and approval history for CM policy.
Incident management outages
Areas of expertise:
Experience in building Platforms - Observability
Experience in Build Tools like Git, Jenkins, Artifactory
Experience in Deployment - Docker, Mesos, Marathon,
Experience in Monitoring – Prometheus, Sensu, Nagios, Graphite
Experience in Log Management tool - Greylog

ObservabilityMicroservicesMonitoringDistributed systemsCapacity analysisDevOps+1

Css corp

Senior Manager

Feb 2004 – Apr 2017 · 13 yrs 2 mos · USA, Ireland, Hyderabad, Chennai, Bangalore

Clients: Inmobi, Google, Argo, Netgear
Core Roles and Responsibilities:
Devops:
Integration of alerts from Newrelic, Awscloudwatch, Prometheus, Nagios, Sysdig to Pagerduty and Slack
L1 troubleshooting of applications alerts
Production deployments using AWS Opswork, Elastic Beanstalk
Troubleshooting Haproxy, Nginx, MLB's
Managing IAM AWS infrastructure
Handling major incidents
Deployments across four datacenter's using Inmobi Deployment Platform
Root cause analysis of major incidents and creating post mortem documents
Problem management, analyzing repetitive alerts and taking corrective action
Nagios integration with alert management system
Creating dashboards in Graphana
Configure Nagios servers using Nconf
Datacenter Operations:
Racking stacking of servers
Managing spares across all datacenters
OS installations using PXE
IDRAC reachability troubleshooting
Raid configurations, disk and memory swapping with minimum downtime
Projects Handled:
Planning and Executing data center migration using PMP methodologies
Involved in migrating one data center in USA within two weeks (350 Servers)
Implemented SIP architecture from scratch using asterisk
IDRAC monitoring across 4 datacenter
Setting up OME and OMPC
Monitoring OME and OMPC boxes using Graphite and Dockerised Nagios
Improvements across people, process and SLA's:
Post-mortem documentation improved from 30% to 100%. All outages tracked and action items followed upon
SLA % improved from 40% to 95% across NOC, DC Ops and Desktop Support
Number of alerts decreased by 50% by proactive problem management
Attrition below 10% by retaining key people in key roles. Moving people across different verticals helped in maintaining attrition