Zahir Ali CP

DevOps Engineer

Bangalore Urban, Karnataka, India20 yrs experience

Most Likely To SwitchHighly Stable

Key Highlights

Over 15 years of experience in DevOps and Cloud Engineering.
Expert in AWS and Kubernetes for resilient system design.
Strong SRE mindset with a focus on performance optimization.

Stackforce AI infers this person is a Cloud Infrastructure Engineer with a strong focus on DevOps and Site Reliability Engineering.

Contact

Skills

Core Skills

AwsKubernetesObservabilitySite Reliability EngineeringInfrastructure ManagementInfrastructure AutomationSystem AdministrationNetwork ManagementMonitoring And Troubleshooting

Other Skills

Alert WorkflowsAnsibleApacheApache KafkaApache ZooKeeperAuto ScalingAutomationCassandraChefCloud ComputingContainerizationCouchbaseData CenterDatabasesDevOps

About

I’m a senior DevOps and Cloud Platform Engineer with over 15 years of experience driving automation, scalability, and observability across mission-critical infrastructures. I specialize in designing resilient systems on AWS and Kubernetes (EKS), building multi-cluster observability platforms with Prometheus & Grafana, and implementing infrastructure as code using Terraform and Ansible. Throughout my career, I’ve worked across highly regulated and complex environments, integrating FIPS-compliant systems, managing secure TLS/HAProxy/CloudFront architectures, and supporting high-scale deployments of databases like Cassandra, Kafka, and MongoDB. I bring a strong SRE mindset to every project — combining deep technical know-how with a passion for performance optimization, root cause analysis, and proactive incident reduction.s.

Experience

Cisco

Technical Lead Cloud operations

Jun 2019 – Present · 6 yrs 9 mos · Bangalore

Leading cloud infrastructure initiatives on AWS with a strong focus on automation, observability, and system reliability. Successfully executed a range of high-impact projects, from data center build-outs to large-scale container orchestration, tailored for regulated environments like FedRAMP.
🔧 Key Responsibilities & Projects:
Designed and led the Colocation (Colo) build-out for a FedRAMP-compliant environment, including secure networking, IAM architecture, and logging compliance.
Co-Architected and executed the migration of EC2-based workloads to Kubernetes (EKS), including containerization strategies, StatefulSet orchestration, and persistent volume planning.
Built robust Cassandra cluster management solutions, including repair orchestration, token-aware backup handling, and heap tuning.
Developed end-to-end observability stacks using Prometheus, Grafana, Loki, and Fluent Bit, with custom alert workflows via Webex.
Automated infrastructure provisioning and environment setup using Ansible, Terraform, and AWS SSM-based deployments.
Created internal tools and scripts in Python and Bash for everything from automated dumps to cost analysis and health checks.
Ensured infrastructure met security compliance standards, including FIPS, TLS/SSL, and role-based access controls.

AWSAutomationObservabilityKubernetesTerraformAnsible+3

Site Reliability Engineer

Feb 2014 – May 2019 · 5 yrs 3 mos · Bangalore

Worked on Central SRE team which took in escalation from noc and to mediate across different verticals of SRE.
Worked on the release team which ensure latest code is pushed out to prod
Worked on the prod Team which was responsible for driving major incident and worked on building tools / vetting process which ensured we are getting MTTD and MTTR down. Also Assists corresponding SRE teams in finding the root cause of major Prod issues.
Build tools backend, frontend facing using python flask, jscript . read and understand java code in production.
Currently in Data SRE team which handles Oracle/Couchbase infrastructure at LinkedIn, Building tools and automation around our oracle infrastructure.

SREPythonJavaScriptOracleCouchbaseSite Reliability Engineering

[24]7

Assistant Manager

Oct 2012 – Jan 2014 · 1 yr 3 mos · Bangalore

Handling the entire infrastructure for ilabs which powers high end data gathering applications based on user-interaction on website in Amazon and Softlayer Defining, planning and setting up monitoring using open source tools like nagios ganglia etc .
Automating using chef
Configuring Load Balancers like Netscaler Amazon ELB , Ha proxy etc
Application/system Debugging using various tools like iostat , vmstat , jquery JMX
Managing application and web servers like apache , tomcat jetty
Manage databases on both mysql servers and RDS hosts in Amazon
Troubleshoot application issue and escalate to developers.
Attend Post Mortem Meetings to discuss various issues within the organization. Also responsible in drafting RCA for issues which falls under the realm of operations
Leveraging all the services provided by amazon like RDS , ELB, EMR, VPC, etc

NagiosChefLoad BalancersMySQLRDSInfrastructure Management

July systems

2 roles

Senoir Lead Operations

Promoted

Jun 2011 – Oct 2012 · 1 yr 4 mos · Bangalore

Taking a leading role in automating infrastructure related activities. Writing custom scripts as required, helping other team members in scripting tasks by guiding them.
Moving our entire application deployment on to Configuration management system (chef). Writing recipes and grouping them as roles for various different applications.
Configuring and maintaining load balancers.
Configured cacti to help in plotting trends for customer services by using Haproxy stats and writing custom perl data gather script
Evaluate all AWS service on offer and integrate them into the infrastructure if required. Implemented Auto Scaling to ramp up servers when load on a particular application goes up. Implemented various AWS products like RDS, ELB, Autoscale groups, VPC. Cloudwatch etc.
Backend infrastructure database to help in keep costs at minimum, to give a quick guidelines on infrastructure.
Troubleshoot application issue and escalate to developers.
Attend Post Mortem Meetings to discuss various issues within the organization. Also responsible in drafting RCA for issues which falls under the realm of operations
Provide solutions on Ad-hoc basis for integrating our infrastructure with third party providers/ customers.
Troubleshoot and act as point of escalation for various open source services used like nfs, ssh, sftp, mysql, apache, tomcat within our network

AWSChefLoad BalancersAuto ScalingInfrastructure Automation

NOC enginner

Jul 2009 – Apr 2010 · 9 mos

Configured and Maintaining the Virtual network using Amazons EC2 service. Also Maintaining the servers in the local Data Center
Configured and Maintaining the Ha proxy Load Balancer.
Good usage of shell script to help in scaling up the network, to make use of Amazons Ec2 service. Making it easy to handle the existing service and addition of new service
Configuring and maintaining Mysql Database server. Configuring Replication when needed.
Configured and maintaining apache servers. Also maintaining Tomcat servers on which the main application runs
Configuring and maintaining nagios monitoring application. Configured to monitor application and servers in both the virtual cloud as well Local Data Center(testing Set up).
Applying Patches provided by the QA Dept to live services. Updating the Patches in common location, which are then picked when launching portals
Writing Shell scripts as and when required.
Creation of additional infrastructure to handle the load or spike which can affect a particular service and cause downtime for that service

EC2MySQLApacheNetwork Management

Aol

Associate System adminstrator

Apr 2010 – Jul 2011 · 1 yr 3 mos

Working as a System administrator for Billing operations which cover all the Back end process of Aols billing system
Taking up oncall rotation as a primary point of contact during US late hours , Ensuring the smooth running of Batch jobs, Online Complex and other billing related Systems
Installing application related Code updates for Billing applications and working with developer on fixing issues arising out of it
Writing scripts to Automate the Install Process as and when applicable
Writing script to monitor special instances which are not covered by the corporate systems.
Updating nagios configuration whenever a new service is added to the billing complex.

System AdministrationScriptingNagios

Glowtouch

remote NOC

Jan 2006 – Jun 2009 · 3 yrs 5 mos

Monitor and troubleshoot the health of corporate servers in addition to shared hosting and virtual dedicated servers. Using a combination of Nagios, Mart and in house developed Incident Management system. These servers provide service to around 700000 customers.
Monitoring and troubleshooting basic application like Apache Mysql, Email issue. Also doing configuration changes, if required.
Maintain a line of communication between Engineers, Developers, Management and the various Support teams to report issues and/or deployments.
Trouble shooting and fixing issues the apache boxes which serves the web site of over 700000 customers
Detecting any DDOS on the network
Troubleshoot any e-mail issues which might occur on the farm. Maintaining the e-mail queue looking out for any spams
Troubleshooting container issues for customer when they buy the professional service offered by our Company , Some tasks include installing plesk interface, setting up DNS server and anything which the customer might have asked as professional service
Resolving tickets in the Noc pools which are escalated from Tier3 or escalating them to system administrator in Boston.

NagiosApacheMySQLMonitoring and Troubleshooting