Arvind Lavania

DevOps Engineer

Noida, Uttar Pradesh, India16 yrs 7 mos experience

Key Highlights

Led migration of Huawei Mobile Services in record 72 days.
Architected world's largest CDN handling 1.9B visits monthly.
Achieved 50% cost optimization migrating microservices to Kubernetes.

Stackforce AI infers this person is a Cloud Infrastructure Architect with expertise in Site Reliability Engineering and SaaS solutions.

Contact

Skills

Core Skills

Site Reliability EngineeringCloud Strategy DevelopmentCloud Migration Planning & ExecutionCloud Solution ArchitectureCloud Security

Other Skills

Server ArchitectureCross-team CollaborationContinuous Process ImprovementServerless ComputingScalability & Performance OptimisationScalability & AvailabilityInfrastructure ReliabilityAmazon Web Services (AWS)PrometheusGrafanaKubernetesContinuous IntegrationAgile MethodologiesProject ManagementComplex Systems

About

Experienced SRE Manager with 15+ years of extensive experience in Site Reliability Engineering and Cloud Solution Architecture. Demonstrated expertise in cloud-native and data center infrastructure development, managing SLAs and SLOs for enterprise-grade services/products, leading large engineering teams, and developing applications in Python/Java. Proficient in implementing scalable cloud solutions to enhance system reliability and performance. Successfully architected and deployed cloud infrastructure to support Huawei's third-largest mobile service globally, serving over 2 billion active users. Leading 2-3 Cloud/SRE Team OKR for ORG level. Demonstrated ability to define cloud strategies, architect scalable and resilient cloud-native applications, and steer successful cloud migration initiatives. Design and configured largescale distributed systems and petabyte-scale data delivery applications, spearheading the design of one of the world's largest Content Delivery Networks (CDNs) at Huawei, completed migration of Adobe Photoshop product from on-prem to Azure cloud, engineered modern cloud solutions for Expedia cloud Infrastructure. Earned the CEO Award for completing the Huawei Mobile Services migration in a record 72 days. Excel at designing and delivering end-to-end cloud solutions that meet the unique needs of diverse business environments.

Experience

16 yrs 7 mos

Total Experience

2 yrs 11 mos

Average Tenure

1 yr 8 mos

Current Experience

Greenlight

Staff Architect

Sep 2024 – Present · 1 yr 8 mos · Hybrid

Thrasio

Site Reliability Engineering Manager

Aug 2022 – Jul 2024 · 1 yr 11 mos · Noida, Uttar Pradesh, India · Remote

Supervise and mentor a team of Software and Systems Reliability Engineers to ensure high uptime and end-to-end availability of critical services. Design and oversee the development of software systems, meeting both functional and non-functional requirements such as scalability, reliability, and security. Provide technical leadership and guidance to development teams, facilitating architectural decisions and
solving complex technical challenges. Directed a team building secure cloud services with high availability, reliability, and security for Thrasio customers.
● Designing and maintaining highly distributed infrastructure in AWS and Azure with 200+ Microservices and 11+ Monolith services with high concurrent RPM.
● Led a team of 15 SREs, driving the adoption of SRE practices such as chaos engineering, SLOs, error budgets, and disaster recovery strategies.
● Developed and managed CI/CD pipelines, automated testing frameworks, and capacity planning processes, resulting in a 40% increase in deployment speed and system reliability.
● Spearheaded a cross-functional initiative to implement a unified monitoring and alerting system using Prometheus and Grafana, improving incident response times by 30%.
● Collaborated with Security, Architecture, Operations, and Product Management teams to ensure seamless service delivery and high availability of critical applications.
● Migrated over 22 microservices from EC2 to Kubernetes, achieving 50% cost optimization.
● Handling 400+ K8 Clusters with in-house operators developed by R&D. Configured Karpainter to reduce Kubernetes costs by 22%.
● Driving SRE Org OKR Security-2024, implement different solutions.
● Designed and implemented monitoring and logging solutions with ELK Stack and Prometheus, leading to improved SLA and faster issue resolution.
● Conducted performance tuning and optimization of critical applications, resulting in a 25% increase in application performance.

Cloud Strategy DevelopmentServer ArchitectureCross-team CollaborationContinuous Process ImprovementServerless ComputingScalability & Performance Optimisation+5

Huawei

Principal Cloud Architect

Aug 2020 – Aug 2022 · 2 yrs · Noida, Uttar Pradesh, India · Hybrid

Architected and refactored payment systems in the AWS cloud environment. Collaborated with internal and external stakeholders across multiple levels to align technological solutions with business objectives. Served as a technology strategist, adept at integrating business and technical strategies. Led a 14-member SRE team for the Payment Business Unit of Huawei Mobile Services.
● Optimised system performance by utilizing various migration techniques such as rehosting, refactoring, re-architecting, re-platforming, and repurchasing.
● Successfully led the design and implementation of one of the world's largest CDNs at Huawei, handling 1.9B visits per month.
● Successfully implemented complex solutions to server the 02 billion active users per month on AWS, Azure, and GCP by demonstrating deep expertise in modern cloud architecture.
● Transformed the team by introducing SRE principles, leading to a 25% reduction in operational toil and improved system resilience.
● Orchestrated the live migration of Huawei App Gallery, Photo, App Store, Weather, and Huawei Pay from AWS to Huawei Cloud, handling 600M visits per month.
● Decreased deployment time by 60% by transitioning components from IP-based configuration to service discovery using Netflix’s Eureka.
● Migrated hundreds of terabytes of data and billions of objects from S3 to Huawei Storage while maintaining service availability.
● Conducted root cause analysis and promoted self-healing systems, reducing mean time to recovery (MTTR) by 20%.
● Managed the migration of on-premises infrastructure to GCP, leveraging containerization technologies to ensure scalability and reliability.

Continuous IntegrationSite Reliability EngineeringAgile MethodologiesProject ManagementComplex SystemsCloud Security+30

Adobe

Lead Site Reliability Engineer

Apr 2018 – Aug 2020 · 2 yrs 4 mos · Noida, Uttar Pradesh, India

Mentored infrastructure and operations personnel globally, ensuring consistent delivery and availability of infrastructure services across all Adobe regions. Designed reliable systems to operate efficiently across multinational data centres. Operated within an Agile development environment, continuously evaluating and enhancing engineering processes. Introduced infrastructure solutions to manage multi-cloud and
in-house infrastructure tools. Liaised with cross-functional teams to address engineering, scalability, and performance needs throughout all phases, encompassing requirements, development, testing, and launch and release.
● Managed Kubernetes and Docker environments for various microservices, overseeing 80+ Kubernetes clusters and implementing scaling strategies to grow the user base from 80K to 300K, enhancing user experience flow.
● Developed and maintained web applications using Java, Python, and JavaScript, ensuring high-quality and scalable code.
● Led the Chrome extension project for the CloudTeam,successfully increasing daily retention rates from 18% to 40%.
● Headed all aspects of production security,patch management,release impact assessments,backup planning,and network planning.
● Engineered a highly scalable multi-cloud framework, deploying hundreds of stacks per minute across AWS and Azure.
● Developed and implemented software to enhance the stability,scalability, availability, and latency of Adobe products.
● Documented best practices and architectural references for product development and deployment.
● Collaborated with cross-functional teams to integrate DevOps best practices, significantly improving software delivery processes.

Continuous IntegrationSite Reliability EngineeringAgile MethodologiesProject ManagementComplex SystemsCloud Security+34

Expedia group

Application Engineer 2

Jun 2015 – Mar 2018 · 2 yrs 9 mos · Gurugram, Haryana, India · Hybrid

Partnered with the Security Team to implement comprehensive security measures for all infrastructure components. Collaborated with internal cloud experts to share ideas and best practices. Co-ordinated with on-site engineers and participated in triage calls to address critical system issues. Led a team to identify and resolve critical system issues impacting customers and revenue. Worked with architects, product management, and engineering teams to develop solutions that enhance platform value.
● Architected secure AWS cloud services, ensuring high availability,reliability, and security for Expedia customers and their assets.
● Established structure and organization of systems,processes,and personnel to ensure 99.99% SLA compliance for the marketplace.
● Managed P1/P2 incident tracking and improved technical confluence documents to enhance DevOps response efficiency.
● Conducted capacity planning,managed SLAs, and built infrastructure whilst establishing an SRE/Infrastructure team.
● Evaluated the Hardware Acceptance List (HAL) for new hosts to build new or existing replicas.
● Managed SLAs and defined and designed tools for production servers and network support.
● Engineered modern cloud solutions to enhance the performance and resilience of Expedia’s cloud infrastructure.

Continuous IntegrationSite Reliability EngineeringAgile MethodologiesProject ManagementCloud SecurityCross-team Collaboration+26

Times internet

Technology Manager

Jul 2009 – Jun 2015 · 5 yrs 11 mos · Noida, Uttar Pradesh, India

Overseeing end-to-end availability and performance of features. Exhibited expertise in scripting languages and configured Chef for application server management. Extended provisioning capabilities to the development team based on approved quotas. Oversaw data management, disaster recovery, and business continuity planning projects. Planned for a significant increase in international traffic, collaborating with multiple CDN providers.
● Designed and developed architecture using Chef and in-house monitoring tools to build scalable, on-demand infrastructure on VMware, reducing provisioning times for virtual machines from 4-6 hours to minutes.
● Centralized and consolidated multiple version control systems to Bitbucket (Git), streamlining agile development and facilitating continuous integration for faster time-to-market.
● Developed a strategy for a hybrid cloud infrastructure, combining in-house VMware-based private cloud and AWS public cloud to provide flexible infrastructure with reduced TCO.
● Directed the migration from RHEL 4/5 to CentOS 6 across 30 product lines, managing timelines, risks, and 150 stakeholders. Completed the migration in 4 months, saving US $450K.
● Spearheaded the migration of production infrastructure from physical data centers to AWS, participating in technical design discussions and reviewing technical documents.
● Developed strategies and plans to design next-generation services, ensuring media properties remained at the forefront of digital content delivery.
● Operated lights-out management for India's top digital sites, including indiatimes.com,with 90 million unique visitors per month.