Rajesh Kondra

SRE (Site Reliability Engineer)

Bengaluru, Karnataka, India17 yrs 10 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

Expert in AWS and cloud migration strategies.
Proven track record in site reliability engineering.
Skilled in automating processes and enhancing monitoring.

Stackforce AI infers this person is a Cloud Computing and Site Reliability Engineering expert with extensive experience in automation and monitoring.

Contact

Skills

Core Skills

MicroservicesCloud ComputingSite Reliability EngineeringDevopsSystem AdministrationMiddleware AdministrationApplication Support

Other Skills

AWSAmazon Web Services (AWS)ApacheAppDynamicsAppdynamicsChefCicdComputer ScienceContinuous DeliveryContinuous ImprovementContinuous IntegrationContinuous Integration and Continuous Delivery (CI/CD)DockerGitGrafana

About

AWS solution architect, Middleware Administration , work on web system support and automation on python. Technology Experience Cloud: AWS, Azure OS Environments: RHEL, Oracle Linux, Ubuntu Linux, Cent OS, Solaris. Network Protocols / Remote Connections: Telnet, rLogin, RSH, SSH, RDP, VPN and VNC. Configuration Management: Chef, Ansible CI / CD: Jenkins, CircleCI CDN : Akamia, Cloud Front DNS : Route53 Load Balancer: HAProxy, ELB Websers: Apache, Nginx Application Server : Weblogic, Jboss, Websphere SQL : Oracle DB NoSQL DB: Cassandra, MongoDB Container: Docker, Docker swam, Kubernetes Queuing Tool: Tibco Remote Monitoring: Nagios, CollectD, Newrelic, AppD, Wily, Dynatrace, Wavefront, Grafana etc. Log Management: Splunk, Logz Project Information Management: JIRA, Confluence. Version Control: GIT, perforce Scripting / Programming: Bash, Python Troubleshooting: nmon, strace, traceroute, nmap, netstat, svmon, iostat, mpstat etc.

Experience

17 yrs 10 mos

Total Experience

2 yrs 6 mos

Average Tenure

5 yrs 6 mos

Current Experience

Nvidia

SRE Manager

Nov 2020 – Present · 5 yrs 6 mos · Bengaluru, Karnataka, India

GeforceNow is an online cloud gaming platform, working on Kubernetes and setting up SRE tools and monitoring to setup the best cloud gaming environment.

MicroservicesScaled Agile FrameworkComputer ScienceCloud Computing

Vmware

Staff SRE VMware

Nov 2019 – Nov 2020 · 1 yr · Greater Bengaluru Area

Staff Site Reliability Engineer for Vmware development platform.
Applied innovative technologies & procedures and skilfully designed the technical architecture and implemented it for providing robust and scalable solutions.
Acted as a solutions evangelist, automated key processes, and contributed towards the expansion of business opportunities and excellence. Responsible for the evolution of existing processes, tools, automation, and technology stack, managed continuous improvement processes and took care of the operational aspects
Built applications efficiently monitoring using Java Open source technologies. Responsible for tracking & integrating using wavefront, setting up L1, L2 monitoring required for VDP applications, and handled production releases.
Enhanced customer experience by ensuring 100% resolution of issues through effective tracking, monitoring, and debugging of issues. Generated weekly reports using these metrics.
Worked on the projects efficiently and Kubernetes upgradations from 1.15 to 1.16 using Rundeck jobs as a key member of the VMware development Platform team. Handled 100+ Kubernetes clusters and different environments in multiple AWS regions. Monitored standardization using properties alert manager, wavefront, grafana, logs, uptime, runscope.
Distinguished contributions in setting up the Kubernetes environment, CICD, and test cases for ensuring seamless deployments. Provided proactive and end to end support for addressing issues with Kubernetes environments.
Defined the infrastructure requirements and sign off infrastructure delivery, based on the existing infrastructure and service catalogue.
Staff Site Reliability Engineer for IOT product.
Implementing best monitoring standards using Java Opentacking, wavefront, logz, uptime.
Working on setting up Akamai rules and best use cases.
Setup DR environment for production applications.

MicroservicesScaled Agile FrameworkComputer ScienceSite Reliability EngineeringCloud Computing

Walmart labs

Sr Site Reliability Engineer

Apr 2018 – Nov 2019 · 1 yr 7 mos · Bengaluru, Karnataka, India

Enhanced business opportunities by establishing robust technical framework involving effective monitoring, issue tracking, documenting in wiki proper runbooks. planning for game day. Participated in go-live checklists.
Built less noisy and more meaningful alerts using Grafana. Auto resolution using Python webhooks. Ran incident management calls and gave updates to SR VP and provided technical action plans for seamless technical functioning. Kept the production applications up and running with zero downtime.
Established the processes in place the requirements for achieving full automation of the management of the platform, managed service specific capacity planning and ensured achievement of the pre-defined KPIs/SLAs.
Consolidated and reduced to single dashboard view build using python, performed root cause analysis, streamlined retail operational functioning through usage of technology. Triaged multiple issues and fixed the problems at the earliest. One-click dashboard got recognition from the VP.
Conducted stress tests, FMEA, RCA, performance testing and planned for capacity. Was the Incident Manager and Commander for production incidents, took care of leadership notification with real customer impact metrics.

MicroservicesScaled Agile FrameworkComputer ScienceSite Reliability EngineeringCloud Computing

Intuit

Sr Devops

Apr 2014 – Apr 2018 · 4 yrs · Bangaon, West Bengal, India

Was part of AWS cloud migration journey for TurboTax
Setup next generation monitoring tools like Newrelic, AppdDynamics and Splunk, wavefront .
Worked on AWS (Amazon Web Service) experience - setting up alarms, launch configuration, auto scaling, tuning, security groups, vpc ,ec2..
Working on python and shell scripting. Getting new automations and contributing new automation projects using python.
Setting up cloud formation templates for AWS cloud migration
Creating CI/CD pipeline for automated deployment.
Setting up new chef cookbooks and recipes.
Experience in tuning Linux systems for better performance
Experience setting up and configuring Apache, weblogic and other Middleware Products
Ability debug issues in production
Experience in setting up proactive monitoring ,working on various monitoring tools like Wily,AppDynamics,New Relic,Wavefront
Working on production Issues to triage and fix them with maintaining MTD and MTR metrics as per the SLA.

MicroservicesSoftware Development Life Cycle (SDLC)TerraformScaled Agile FrameworkComputer ScienceDevOps+1

Tech mahindra

Senior System Administrator

May 2011 – Apr 2014 · 2 yrs 11 mos · Bangalore India

Worked as a Web System Analyst with an Australian Telecom giant providing 16*5 support to their Servers hosted on Weblogic and Apache.As an administration was responsible for installation ,configuration and maintaience of these Servers.Resolving server related issues within SLA level and provide RCA
Well verse with Systems Operations and work effectively with Infrastructure Groups, Application Vendor and Product Vendors to ensure Business issues are resolved in a timely manner hence providing resolution to Customer issues and reduction of customer service Average Handling Time (AHT).
Effectively participate in Planning, Delivery and Implementation support for Key Product launches as well as Christmas Capacity Planning and System Monitoring Automation.
Actively recommend and work on System Improvement Initiatives which helps in reduction of incoming tickets to IT and thus, had a direct impact on better Customer Service.
Analyzing & working with Business on understanding high severity issues, performing root cause analysis and earliest rectification of issues by applying the right troubleshooting skills.
Involvement in User Acceptance Testing (UAT) support, Product Construct Testing (PCT) and Business Readiness Testing (BRT) support.
Coordinating on issues/Defect walk through with Business stakeholders for the UAT phase.
Building, Managing & reviewing the Knowledge base.
Responsible for managing the Production reoccurring/known Issues tracker, tracking all user escalated issues. The responsibility included discussing/prioritizing incoming issues, daily trend reporting and updating of business stakeholders on the status of their escalated problems.
Preparing estimates and resource allocation plan for migration/Up-gradation projects.
Pro-actively identifying and documenting system functional and performance enhancements.
Responsible for installation, configuration and support of middleware platform like Weblogic, Jboss, Tomcat.

Computer ScienceSystem Administration

Ipsoft

Middleware Server Administation

Aug 2010 – May 2011 · 9 mos

My role includes installation, configuration, handling escalation and shift management. We are in Shared Offshore Model (Supporting Multiple Clients) supporting Middleware technologies like Websphere,Web logic,Jboss, Tomcat
Manage L2/L3 calls/escalation within the shift.
Troubleshooting and administration of Application Servers like Weblogic 8.x, 9.x and 10.
Providing RCA (root cause analysis) in case of outages.
Involved in projects for setting up the JMX monitoring, creating SOP's and escalation procedures.
Perform software installation, upgrades/patches, troubleshooting, and maintenance on all Weblogic Application servers.
Analyze and resolve problems on application and web servers.
Participate in root-cause analysis of recurring issues, system backup, and security setup.
Day to day coordination with onsite team members and Datacenter Team for Issues/Query and weekend activity.
Production support on 24 X 7 basis and rest of the period covered through On call.
Daily reporting of Incidents and Change management
Management of Apache Based Secure Reverse Proxy

Computer ScienceMiddleware Administration

Convergys

Platform support

Jul 2008 – Aug 2010 · 2 yrs 1 mo · Greater Hyderabad Area

As a part of Application Production support team, my primary responsibility is to ensure smooth day to day operations in production environment of the Mastercard in production as well as non prod, a high availability distributed application supporting more than 40 different applications.
Installing, configuring, setting and managing and monitoring of application environment as a webadmin.
Installing, configuring, administration web applications like PPM (Price Plan Management & SelfCare) in Web Sphere Advanced Server 3.5/4.0 on Solaris and CSM (Customer Service Management) in Weblogic App server on Solaris
Creating and configuring proxy servers like Apache as a plug-in.
Plan, coordinate and execute maintenance/feature release installations adhering to the Change Control and Release Management Processes.
Performance and fine-tuning of the Tuxedo/CORBA JVM, Weblogic server domain’s application services.
Resolving the production/Non-production related Trouble Reports / Incident Reports, conducting root cause analysis and plan necessary process improvement.
Installing, Upgrading and maintenance of Patch and Package Installations in Development, testing UAT & Pre-Prod and production environments.
Troubleshooting of NFS & NIS File System problems.
System/Application resource monitoring and troubleshooting (like file system space, IPC resources, CPU, Swap Space)
Regular monitoring of file system, daemons, server transaction capacities using tools like BMC Patrol.
Ensure backups of critical flat files are scheduled and completed.
Taking part in the Change Control Board discussions representing the SA team.
Maintenance and configurations of on-line and/or real-time interfaces (Transition, Lightbridge) for the application