Vibhav Chary

VP of Engineering

Bengaluru, Karnataka, India22 yrs 2 mos experience
Highly StableAI Enabled

Key Highlights

  • Achieved 99.99% infrastructure uptime.
  • Reduced operational costs by over 50%.
  • Led cross-functional teams to optimize security and performance.
Stackforce AI infers this person is a DevOps and SRE leader in the SaaS industry.

Contact

Skills

Core Skills

SreDevops

Other Skills

AI/MLAWSAWS EKSAmazon Web Services (AWS)Analytical SkillsApache MesosAutomationBusiness RequirementsCapacity analysisCapital ExpendituresChaos engineeringCloud ComputingCommunicationControl Tower architectureCost reduction

About

Design,build,measure,track&maintain an infra uptime of 99.99% with minimum security vulnerabilities. Hands-On, with specialties in Kubernetes, Helm, Istio, Gitlab CI/CD,Linux, Observability, Logging,AWS, DevOps/SRE Tools & Processes, Incident, Problem, Change Management, Project Management, Cloud Migrations, Cost, Database Migrations, CloudMongo, Kong API Gateway

Experience

22 yrs 2 mos
Total Experience
6 yrs 11 mos
Average Tenure
1 yr 4 mos
Current Experience

Fourkites, inc.

Vice President Engineering AI/Agentic/MCP(SRE,DevSecOps,DevOps,Platforms, Infosec,IT)

Feb 2025Present · 1 yr 4 mos · Bengaluru, Karnataka, India · Remote

  • Key Responsibilities :
  • Strategic Leadership & Vision: Provide visionary leadership in defining and executing a holistic platform strategy that seamlessly integrates Site Reliability Engineering (SRE), DevOps, Security, and Platform Engineering principles.
  • Intelligent Automation: Drive the development and adoption of advanced agentic workflows, leveraging Model Context Protocol (MCP) and sophisticated AI/ML models to create highly autonomous and intelligent operational frameworks.
  • Organizational Optimization: Champion initiatives focused on significantly enhancing developer productivity, optimizing system availability and resiliency, and fortifying the overall security posture across the enterprise.
  • Cross-Functional Integration: Oversee the strategic integration of diverse engineering tools and practices to foster a unified, efficient, and secure development and operations ecosystem.
  • Innovation & Future-Proofing: Act as a key driver of innovation, continuously exploring emerging technologies and methodologies to ensure our engineering platforms remain at the forefront of industry best practices and future demands.
  • Team Empowerment: Lead and mentor high-performing architecture and engineering teams, fostering a culture of technical excellence, collaboration, and continuous improvement.
  • My focus is on translating complex technical challenges into strategic opportunities, delivering solutions that not only meet immediate business needs but also establish a robust, scalable, and secure foundation for future growth.
KubernetesDevSecOpsSREPlatform EngineeringAI/MLDevOps

Niyo solutions inc.

3 roles

Vice President | SRE | DevSecOps | DevOps | Platforms | Cloud | Infra | at NiYO

Promoted

Oct 2023Feb 2025 · 1 yr 4 mos

  • ● Design Active/Active architecture entirely on Gitops , with minimal cost
  • ● Self Manage Kafka on Kubernetes at scale. 90% savings on existing burn
  • ● Gitops (Successfully managed an entire bank account with read only access to aws console)
  • ● Built custom logging solution reducing 70% of cost ( 220k $ to 60K $ for 1 TB ingestion per day).
  • ● Improved Developers productivity by moving to DevTron open source platform
  • ● Using Kubecost open source platform, improved my Savings plan utilisation from 60% to 90%
  • ● Lead the development and maintenance of SRE platforms, toolsets, and infrastructure, ensuring they are robust, scalable,
  • aligned with organizational needs and they improve developers productivity
  • ● Drive the adoption of SRE best practices, including automation, monitoring, and incident response.
GitOpsKafkaKubernetesSRE best practicesCost reductionDevOps+1

Senior Director Of Engineering | SRE | DevSecOps | DevOps | Platforms | Cloud | Infra | at NiYO

Aug 2021Oct 2023 · 2 yrs 2 mos

  • Improve Dev to Devops ratio ( 50:1)
  • 1.Automated Transition from Imported ACM Certificates to Amazon Issued ACM
  • 2.Server Patching and Maintenance Automation:
  • 3.Database Updates and Deletion Workflow Integration:
  • 4.Infrastructure as Code (IaC) with Terraform:Leveraged Terraform for Day Zero infrastructure
  • provisioning, enabling efficient resource deployment and management.
  • 5.EKS Cluster Upgrades:Managed Kubernetes version upgrades for EKS clusters, ensuring compatibility
  • and security in an environment where new versions were released every three months.
  • 6.Control Tower Architecture for Enhanced Security:Introduced a Control Tower architecture to eliminate
  • and reduce security vulnerabilities resulting from DevOps oversights, enhancing overall system
  • security.
  • 7.Self-Service Handover to Development Teams.Enabled development teams to create NS and EW
  • routes for new microservices independently, streamlining network management.Microservice
  • Onboarding with Helm Charts:
  • 8.Developed standardized monitoring templates covering metrics such as 5xx and 4xx responses,
  • latencies, pod restarts, and Apdex score, empowering development teams to maintain application
  • health.
TerraformKubernetesAutomationControl Tower architectureMonitoring templatesDevOps+1

Director Of Engineering SRE

Sep 2019Aug 2021 · 1 yr 11 mos

  • Achievements :
  • 1. Moving from Kops to AWS EKS
  • 2. UAT, Beta and Prod on all same version of AWS EKS and Kong
  • 3. Database Migration from CloudMongo to PCI certified Atlas MongoDb
  • 4. Setup Two certified PCI environments from scratch
  • 5. Moving from NewRelic to Datadog to reduce 50% cost
  • 6. Containerisation of Kong Api Gateways
  • 7. Setting up observability end to end from scratch (Infra, Application, Logging, Alerts, Visualization,RCA’s, SLO,Incident and Major Incidents
  • 8. DR Drills Compliance by using chaos engineering
  • Devops Operational Excellence:
  • Automated ssl certificate renewal on elb’s/cloudfront via ci/cd
  • MTTD: 5 mins
  • MTTR: 30 mins
  • Meeting developers infra requirements within 48hrs
  • Managing infrastructure as code via terraform
  • ◦ Kong upgrades
  • ◦ Setting up new infra from scratch (VPC, Subnets, EKS clusters, Eks Nodes, Security groups )
  • ◦ Mutual ssl certificate changes
  • ◦ AWS system manager to automate patching on servers
  • Costs:
  • Reduced 50% of aws bill by implementing the below:
  • 1. Savings Plan combined with reservations
  • 2. Using spot instances in Beta/Uat environment
  • 3. Using AMD based processors instead of Intel
  • 4. Judiciously utilising S3 bucket policies
  • 5. Automated removal of unused EBS volumes
  • 6. Covering RDS, Elasticsearch, Elastic cache under reservations
  • Process Improvements:
  • Reduced Dev – SRE interactions by 40% by creating FAQ’s
  • Automated jira tickets by email to track all Dev’s requests
  • SLA tracking, escalation process of all Dev Requirements
  • Vendor Negotiations :
  • Negotiated with Datadog to cover our App/Infra monitoring at 50% of the NewRelic Cost
  • Centralising the MongoDb Databases , helped save 30% of the costs
  • Reduced our daily logging from 500GB to 100GB per day, which saved us 60% of the costs
  • Moving the NOC from Dedicated to Shared model , helped us reduce 60% of the costs
AWS EKSDatabase MigrationObservabilityCost reductionChaos engineeringDevOps+1

Ola (ani technologies pvt. ltd)

Senior Engineering Manager Devops

May 2017Aug 2019 · 2 yrs 3 mos · Bengaluru Area, India

  • Build observability into micro services ecosystem for tracing and debugging
  • Observability platform:
  • Monitoring: Prometheus, Sensu
  • Alerting/visualisation: Grafana, PagerDuty
  • Distributed systems tracing infrastructure using New Relic
  • Logging: Graylog
  • Build Observability on below Infra components:
  • Mesos Master
  • Mesos Slaves
  • Marathon
  • HAProxy/MLB's
  • Unbound
  • PDNS
  • Git
  • Artifactory
  • Mesos-ZK
  • Kafka
  • Reduce MTTD(Mean time to Detect) and MTTR(Mean time to Recover) for production issues using the observability platform
  • Reduce the cost - CPU/ memory/ EBS by analysing capacity
  • Reduce S3 storage cost using custom boto 3 scripts, AWS Analytics
  • Explore open source and other options to meet our automation requirements
  • Taking architectural decisions for building highly available and large scale distributed systems
  • Kong/Repose to throttle api traffic
  • Hysterix for Circuit Breakers
  • HA-Proxy/Ngnix for loadbalancing/Routing
  • Read / Write Traffic on Database servers
  • When to use shared vs dedicated database servers
  • What metrics to monitor
  • Using Redis cache to give breather for databases
  • Experience in Sprint cycles / Planning using Jira
  • Interacting with Internal and External Auditors for ITGC
  • Auditing IAM Users ( Active and De-Activated Users)
  • Access controls to AWS Infra
  • Change Management Policy
  • Revision and approval history for CM policy.
  • Incident management outages
  • Areas of expertise:
  • Experience in building Platforms - Observability
  • Experience in Build Tools like Git, Jenkins, Artifactory
  • Experience in Deployment - Docker, Mesos, Marathon,
  • Experience in Monitoring – Prometheus, Sensu, Nagios, Graphite
  • Experience in Log Management tool - Greylog
ObservabilityMicroservicesMonitoringDistributed systemsCapacity analysisDevOps+1

Css corp

Senior Manager

Feb 2004Apr 2017 · 13 yrs 2 mos · USA, Ireland, Hyderabad, Chennai, Bangalore

  • Clients: Inmobi, Google, Argo, Netgear
  • Core Roles and Responsibilities:
  • Devops:
  • Integration of alerts from Newrelic, Awscloudwatch, Prometheus, Nagios, Sysdig to Pagerduty and Slack
  • L1 troubleshooting of applications alerts
  • Production deployments using AWS Opswork, Elastic Beanstalk
  • Troubleshooting Haproxy, Nginx, MLB's
  • Managing IAM AWS infrastructure
  • Handling major incidents
  • Deployments across four datacenter's using Inmobi Deployment Platform
  • Root cause analysis of major incidents and creating post mortem documents
  • Problem management, analyzing repetitive alerts and taking corrective action
  • Nagios integration with alert management system
  • Creating dashboards in Graphana
  • Configure Nagios servers using Nconf
  • Datacenter Operations:
  • Racking stacking of servers
  • Managing spares across all datacenters
  • OS installations using PXE
  • IDRAC reachability troubleshooting
  • Raid configurations, disk and memory swapping with minimum downtime
  • Projects Handled:
  • Planning and Executing data center migration using PMP methodologies
  • Involved in migrating one data center in USA within two weeks (350 Servers)
  • Implemented SIP architecture from scratch using asterisk
  • IDRAC monitoring across 4 datacenter
  • Setting up OME and OMPC
  • Monitoring OME and OMPC boxes using Graphite and Dockerised Nagios
  • Improvements across people, process and SLA's:
  • Post-mortem documentation improved from 30% to 100%. All outages tracked and action items followed upon
  • SLA % improved from 40% to 95% across NOC, DC Ops and Desktop Support
  • Number of alerts decreased by 50% by proactive problem management
  • Attrition below 10% by retaining key people in key roles. Moving people across different verticals helped in maintaining attrition
AWSNagiosGraphiteData center operationsIncident managementDevOps+1

Education

National Institute of Technology Durgapur

EE — Electrical

Jan 1999Jan 2003

Akkamahadevi Vidya Samste , Bhadravathi , Karnataka

PUC

Jun 1998Apr 1999

St. Charle's Degree College, Bhadravathi, Karnataka

10th

Apr 1995Apr 1996

Stackforce found 100+ more professionals with Sre & Devops

Explore similar profiles based on matching skills and experience