Arnab Roy

DevOps Engineer

Mercer Island, Washington, United States19 yrs 11 mos experience
Highly StableAI Enabled

Key Highlights

  • Expert in building GPU-heavy supercomputing environments.
  • Proven track record in AI and HPC optimization.
  • Strong leadership in site reliability engineering.
Stackforce AI infers this person is a Cloud Computing and AI Infrastructure expert with extensive experience in performance optimization.

Contact

Skills

Core Skills

Artificial Intelligence (ai)High Performance Computing (hpc)Technology LeadershipSite Reliability EngineeringInfrastructureMachine LearningCloud ComputingCapacity PlanningPerformance EngineeringEmbedded SystemsSoftware Development

Other Skills

Graphics Processing UnitLinuxSlurm Workload ManagerLustreInfinibandNvidia Base Command ManagerRoCENCCLAmazon Web Services (AWS)ArchitecturePython (Programming Language)React.jsMySQLPHPDistributed Systems

About

I build and operate cloud-based supercomputing environments — GPU-heavy, Slurm and K8s -orchestrated, high-availability compute clusters. I specialize in distributed AI training systems, scaling, reliability, and performance optimization. My background spans HPC, cloud infra, and technical leadership.

Experience

19 yrs 11 mos
Total Experience
3 yrs 9 mos
Average Tenure
1 yr
Current Experience

Nvidia

Senior AI-HPC Cluster Engineer

May 2025Present · 1 yr · Redmond, WA · Hybrid

  • NVIDIA DGX Cloud:
  • Leading in designing, implementing, operationalizing and optimizing large-scale GPU Superclusters to support AI and HPC, Supercomputing models (eg: Nvidia Nemotron : https://github.com/NVIDIA-NeMo/Nemotron) with the help of Nvidia Lepton (https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/).
Artificial Intelligence (AI)High Performance Computing (HPC)Graphics Processing UnitLinuxSlurm Workload ManagerLustre+4

Jpmorganchase

Senior Lead Site Reliability Engineer

Apr 2023May 2025 · 2 yrs 1 mo · Seattle, Washington, United States · On-site

  • Consumer & Community Banking :
  • Leading Technical Leadership team of Connected Commerce SRE division.
  • This division manages all Chase's credit card franchise, Zelle transfers, Travel, Dining, Shopping, Offers, Lending programs, as well as all Consumer and Small Business payments.
Technology LeadershipMachine LearningInfrastructureAmazon Web Services (AWS)Site Reliability Engineering

Meta

2 roles

Staff Performance and Capacity Engineer

Aug 2021Jan 2023 · 1 yr 5 mos

  • FB Infra :
  • Cloud Infrastructure tech lead solving optimization related high impact problems.
  • FB Product :
  • Improving reliability, performance, scalability, capacity and availability of Facebook's Search Engine (Unicorn).
Technology LeadershipMachine LearningInfrastructureArchitecturePython (Programming Language)React.js+6

Senior Performance and Capacity Engineer

Jun 2018Jul 2021 · 3 yrs 1 mo

  • FB Infra:
  • Machine Learning tech lead to forecast Facebook's compute growth.
  • FB Product :
  • Improving performance, scalability, capacity and availability of Facebook's BLOB storage (Everstore).
Technology LeadershipPython (Programming Language)React.jsMySQLPHPPerformance Engineering+6

Oracle

2 roles

Principal Member of the Technical Staff

Aug 2016Jun 2018 · 1 yr 10 mos · Bellevue, Washington, United States · On-site

  • Oracle Analytics Cloud :
  • Tech lead to improve cloud scalability, performance and reliability.
Technology LeadershipJava (Programming Language)OraclePerformance EngineeringCapacity PlanningDistributed Systems+4

Senior Member of Technical Staff

Jun 2014Jul 2016 · 2 yrs 1 mo · Bellevue, Washington, United States · On-site

  • Oracle Business Intelligence Cloud :
  • Developed cloud upgrade verification tool (US Patent 10228932) to analyze and compare cloud deployments and migrations.
  • Developed infra tools to improve Oracle's Business Intelligence Cloud scalability.
Technology LeadershipJava (Programming Language)OraclePerformance EngineeringCapacity PlanningDistributed Systems+4

Microsoft

2 roles

Software Design Engineer 2

Sep 2011May 2014 · 2 yrs 8 mos · Redmond, Washington, United States · On-site

  • Microsoft OneDrive :
  • Developed full stack products that helped scaling OneDrive from 5 million to 250+ million users.
  • Developed cross region cloud deployment tool for OneDrive multi region expansion
C#.net (Programming Language)SQL ServerPowerShellPerformance EngineeringCapacity PlanningDistributed Systems+4

Software Design Engineer in Test

Jun 2008Aug 2011 · 3 yrs 2 mos · Redmond, Washington, United States · On-site

  • Microsoft Services :
  • Developed test automation for data access layer for Hotmail (now Outlook) and SkyDrive (now OneDrive).
  • Developed test automation for async queue framework for OneDrive.
  • Microsoft Server :
  • Scaled and tested Microsoft's mobile devices platform to 300K devices in a single corporate domain
C#.net (Programming Language)SQL ServerPowerShellPerformance EngineeringCapacity PlanningDistributed Systems+4

Synergy microwave corporation

Software Design Engineer

Oct 2005May 2008 · 2 yrs 7 mos · Paterson, New Jersey, United States · On-site

  • Open Source Development :
  • Developed firmware for ARM microprocessor and ATMEL micro-controller series.
  • Developed open source, cross platform products and tools for telecom companies.
J2EE Web ServicesC (Programming Language)C++MicrocontrollersEmbedded SystemsSoftware Development

Indian statistical institute, kolkata

Software Design Engineer

Jun 2003Jul 2004 · 1 yr 1 mo · Kolkata, West Bengal, India · On-site

  • Machine Learning :
  • Developed ML model for Defence Research and Development Organization (DRDO) sponsored project on protein folding prediction, similar to : https://www.technologyreview.com/2020/11/30/1012712/deepmind-protein-folding-ai-solved-biology-science-drugs-disease/
Machine LearningC (Programming Language)C++Artificial Intelligence (AI)Neural Networks

Education

Massachusetts Institute of Technology

Advanced Educational Program — Systems Architecture

New Jersey Institute of Technology

MS (Master of Science) — Computer Science

Founder Institute

Entrepreneurship — Startup

IEEE

IEEE Continuing Education — Computer Science

Vidyasagar University

BE (Bachelor of Engineering) — Computer Science and Engineering

St. Xavier's College (Autonomous), Kolkata

High School — Science

Stackforce found 100+ more professionals with Artificial Intelligence (ai) & High Performance Computing (hpc)

Explore similar profiles based on matching skills and experience