Neeraj Tiwari

SRE (Site Reliability Engineer)

Berlin, Germany15 yrs 1 mo experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Expert in Kubernetes and Docker for cloud infrastructure.
  • Proven track record in implementing SRE practices.
  • Strong experience in observability and automation tools.
Stackforce AI infers this person is a Site Reliability Engineer specializing in cloud infrastructure and automation in the SaaS industry.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud ComputingInfrastructure AutomationSystem AdministrationOperations

Other Skills

Agile EnvironmentAgile MethodologiesAlgorithmsAmazon Web Services (AWS)Analytical SkillsAnsibleApache DruidArgoBashCapacity PlanningCenter of ExcellenceClean CodingCloud InfrastructureCloud MigrationCloud Operations

About

🚀 Seasoned Site Reliability Engineer | DevOps Expert | Cloud Enthusiast I have been in the tech industry since 13 years , I bring a wealth of experience across various roles & dimensions, including SRE, DevOps, System Admin & Development. My journey has been marked by delivering exceptional results in designing and optimizing complex infrastructure solutions, implementing SRE practices in microservices from the ground up, and leveraging cloud technologies to drive innovation. Career Highlights: • In-Depth knowledge in Kubernetes, Docker, and containerization, covering runtime, base images, and scaling best practices. • Proven track record of designing and implementing scalable, highly available architectures on AWS, optimizing for security, cost efficiency, and performance. • Familiar with SRE principals and how to rollout SRE from ground up in an organization • Strong experience in observability practices, including OpenTelemetry, instrumentation, and the creation and management of SLOs and SLIs. • Proficient in Python, with working knowledge of TypeScript and Go, focused on developing reliable and automated solutions. • In-depth knowledge of cloud-native CI/CD practices and tools, driving seamless software delivery through end-to-end automation. • Successfully led and delivered multiple end-to-end projects, including production load testing and SLO implementation, driving reliability across organizations. • Expertise in leveraging the open-source tooling ecosystem to enhance operational efficiency. • Proven experience in automation tools like Terraform, Ansible, Jenkins, Puppet, ArgoCD, CloudFormation, etc • Experience in managing bare metal infrastructure and overseeing data center operations rollout. • Over 6+ years of experience in Linux system administration, specializing in troubleshooting, maintenance, and upgrades.

Experience

15 yrs 1 mo
Total Experience
1 yr 10 mos
Average Tenure
3 yrs 7 mos
Current Experience

Celonis

Senior Site Reliability Engineer

Nov 2022 – Present · 3 yrs 7 mos · Berlin Metropolitan Area · Remote

  • Led multiple organization-wide projects, including developing an integrated engineering metrics system for enhanced collaboration and data-driven decisions, implementing Kubernetes best practices across applications, and revamping RabbitMQ setup, reducing incidents by 70%. Spearheaded production readiness reviews and Enhanced Observability principals like SLOs for critical user journeys, improving reliability and customer satisfaction. Actively participated in architecture discussions, while enhancing the stability of a newly acquired product, reducing issues by 30%. Streamlined the incident management process, significantly lowering MTTR through well-defined roles and responsibilities. Additionally, established an SRE engagement model to promote SRE practices within engineering teams.
Site Reliability EngineeringObservabilityPublic CloudTypeScriptCloud ComputingREST APIs+20

Zalando se

Senior SRE/Devops Engineer

Nov 2020 – Nov 2022 · 2 yrs · Dortmund, North Rhine-Westphalia, Germany · On-site

  • Supported development teams in migrating to Kubernetes stack, providing valuable feedback on designs and architecture.
  • Proposed and implemented innovative solution for AWS Cost Reporting using Python Flask.
  • Reduced manual effort by 1 hour by developing solution to pre-scale AWS ECS tasks based on load test metrics.
Site Reliability EngineeringObservabilityPublic CloudCloud ComputingREST APIsProblem Solving+21

Airtel

Lead Site Reliability Engineer

Jan 2019 – Nov 2020 · 1 yr 10 mos · Noida

  • Led the establishment and development of a high-performing team of site reliability engineers at airtel in Noida.
  • Collaborated with cross-functional teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Defined new incident response policies resulting in a 40% reduction in MTTR.
  • Successfully transitioned a mission critical application from a third-party vendor, reducing dependency by 60%.
Site Reliability EngineeringApache DruidPublic CloudAgile MethodologiesCloud ComputingData Centers+23

Paytm

Sr. Devops Engineer

Aug 2016 – Dec 2018 · 2 yrs 4 mos · Noida

  • Leveraged AWS services such as EC2, S3, RDS, and Lambda to architect and deploy scalable and highly available cloud infrastructure, ensuring optimal performance and cost efficiency.
  • Developed Infrastructure as Code (IaC) templates using Terraform and AWS CloudFormation, allowing for consistent and repeatable provisioning of resources.
  • Implemented robust monitoring and logging solutions with CloudWatch, ELK Stack, and ELK to proactively identify and resolve performance issues and optimize system performance.
Public CloudCloud ComputingProblem SolvingData CentersOperationsContinuous Integration and Continuous Delivery (CI/CD)+13

Globallogic

DevOps Consultant

Nov 2015 – Aug 2016 · 9 mos

  • Designed and implemented automated CI/CD pipelines using Jenkins, GitLab CI/CD, and AWS CodePipeline for seamless software delivery.
  • Developed Infrastructure as Code templates using Terraform and AWS CloudFormation for consistent resource provisioning.
  • Collaborated with development teams to optimize build pipelines and enhance software delivery efficiency.
Public CloudBashProblem SolvingOperationsContinuous Integration and Continuous Delivery (CI/CD)Communication+11

Stmicroelectronics

Sr System Engineer

Aug 2014 – Nov 2015 · 1 yr 3 mos · Greater Noida

  • Worked as system administrator, handled a large fleet of linux servers and multiple mission critical applications and provided production support in a 24/7 environment. Also, automated multiple manual work by using shell script and python programming language.
Data CentersOperationsInfrastructure AutomationShell ScriptingTomcatZabbix+5

Sopra steria

Support Engineer

Sep 2013 – Aug 2014 · 11 mos · Noida Area, India

  • Provided production support to critical infrastructure of Airbus server infrastructure using linux, shell scripting, jboss, and apache tomcat.
  • Collaborated with cross-functional teams to troubleshoot and resolve server issues, ensuring minimal downtime.
  • Implemented proactive monitoring solutions to prevent potential system failures and optimize performance.
  • Streamlined processes resulting in a 15% increase in system efficiency and stability.
Agile EnvironmentOperationsShell ScriptingTomcatTroubleshootingLinux+1

Tech mahindra

Associate System Engineer

Apr 2011 – Sep 2013 · 2 yrs 5 mos

Problem SolvingOperationsCommunicationHP-UXTroubleshootingIBM Mainframe+1

Education

Punjab Technical University

Master of Computer Applications - MCA — Computer Science

Jan 2011 – Jan 2013

Makhanlal Chaturvedi National University of Journalism and Communication, Bhopal

Bachelor of Computer applications — computers

Jan 2007 – Jan 2010

Punjab Board

10+2 — Science

Jan 2004 – Jan 2006

Stackforce found 100+ more professionals with Site Reliability Engineering & Cloud Computing

Explore similar profiles based on matching skills and experience