Arnab Roy

DevOps Engineer

Mercer Island, Washington, United States19 yrs 11 mos experience

Highly StableAI Enabled

Key Highlights

Expert in building GPU-heavy supercomputing environments.
Proven track record in AI and HPC optimization.
Strong leadership in site reliability engineering.

Stackforce AI infers this person is a Cloud Computing and AI Infrastructure expert with extensive experience in performance optimization.

Contact

Skills

Core Skills

Artificial Intelligence (ai)High Performance Computing (hpc)Technology LeadershipSite Reliability EngineeringInfrastructureMachine LearningCloud ComputingCapacity PlanningPerformance EngineeringEmbedded SystemsSoftware Development

Other Skills

Graphics Processing UnitLinuxSlurm Workload ManagerLustreInfinibandNvidia Base Command ManagerRoCENCCLAmazon Web Services (AWS)ArchitecturePython (Programming Language)React.jsMySQLPHPDistributed Systems

About

I build and operate cloud-based supercomputing environments — GPU-heavy, Slurm and K8s -orchestrated, high-availability compute clusters. I specialize in distributed AI training systems, scaling, reliability, and performance optimization. My background spans HPC, cloud infra, and technical leadership.

Experience

19 yrs 11 mos

Total Experience

3 yrs 9 mos

Average Tenure

1 yr

Current Experience

Nvidia

Senior AI-HPC Cluster Engineer

May 2025 – Present · 1 yr · Redmond, WA · Hybrid

NVIDIA DGX Cloud:
Leading in designing, implementing, operationalizing and optimizing large-scale GPU Superclusters to support AI and HPC, Supercomputing models (eg: Nvidia Nemotron : https://github.com/NVIDIA-NeMo/Nemotron) with the help of Nvidia Lepton (https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/).

Artificial Intelligence (AI)High Performance Computing (HPC)Graphics Processing UnitLinuxSlurm Workload ManagerLustre+4

Jpmorganchase

Senior Lead Site Reliability Engineer

Apr 2023 – May 2025 · 2 yrs 1 mo · Seattle, Washington, United States · On-site

Consumer & Community Banking :
Leading Technical Leadership team of Connected Commerce SRE division.
This division manages all Chase's credit card franchise, Zelle transfers, Travel, Dining, Shopping, Offers, Lending programs, as well as all Consumer and Small Business payments.

Technology LeadershipMachine LearningInfrastructureAmazon Web Services (AWS)Site Reliability Engineering

Oracle

2 roles

Principal Member of the Technical Staff

Aug 2016 – Jun 2018 · 1 yr 10 mos · Bellevue, Washington, United States · On-site

Oracle Analytics Cloud :
Tech lead to improve cloud scalability, performance and reliability.

Technology LeadershipJava (Programming Language)OraclePerformance EngineeringCapacity PlanningDistributed Systems+4

Senior Member of Technical Staff

Jun 2014 – Jul 2016 · 2 yrs 1 mo · Bellevue, Washington, United States · On-site

Oracle Business Intelligence Cloud :
Developed cloud upgrade verification tool (US Patent 10228932) to analyze and compare cloud deployments and migrations.
Developed infra tools to improve Oracle's Business Intelligence Cloud scalability.

Technology LeadershipJava (Programming Language)OraclePerformance EngineeringCapacity PlanningDistributed Systems+4

Microsoft

2 roles

Software Design Engineer 2

Sep 2011 – May 2014 · 2 yrs 8 mos · Redmond, Washington, United States · On-site

Microsoft OneDrive :
Developed full stack products that helped scaling OneDrive from 5 million to 250+ million users.
Developed cross region cloud deployment tool for OneDrive multi region expansion

C#.net (Programming Language)SQL ServerPowerShellPerformance EngineeringCapacity PlanningDistributed Systems+4

Software Design Engineer in Test

Jun 2008 – Aug 2011 · 3 yrs 2 mos · Redmond, Washington, United States · On-site

Microsoft Services :
Developed test automation for data access layer for Hotmail (now Outlook) and SkyDrive (now OneDrive).
Developed test automation for async queue framework for OneDrive.
Microsoft Server :
Scaled and tested Microsoft's mobile devices platform to 300K devices in a single corporate domain

C#.net (Programming Language)SQL ServerPowerShellPerformance EngineeringCapacity PlanningDistributed Systems+4

Synergy microwave corporation

Software Design Engineer

Oct 2005 – May 2008 · 2 yrs 7 mos · Paterson, New Jersey, United States · On-site

Open Source Development :
Developed firmware for ARM microprocessor and ATMEL micro-controller series.
Developed open source, cross platform products and tools for telecom companies.

J2EE Web ServicesC (Programming Language)C++MicrocontrollersEmbedded SystemsSoftware Development

Indian statistical institute, kolkata

Software Design Engineer

Jun 2003 – Jul 2004 · 1 yr 1 mo · Kolkata, West Bengal, India · On-site

Machine Learning :
Developed ML model for Defence Research and Development Organization (DRDO) sponsored project on protein folding prediction, similar to : https://www.technologyreview.com/2020/11/30/1012712/deepmind-protein-folding-ai-solved-biology-science-drugs-disease/

Machine LearningC (Programming Language)C++Artificial Intelligence (AI)Neural Networks