Manish Gupta

CEO

San Jose, California, United States17 yrs 1 mo experience

Key Highlights

  • Ph.D. in Computer Science with extensive GPU expertise.
  • Gold Medalist from IIT Roorkee in Electrical Engineering.
  • Significant contributions to NVIDIA's CUTLASS project.
Stackforce AI infers this person is a highly skilled GPU and Compiler Engineer in the AI/ML industry.

Contact

Skills

Core Skills

GpuCompilersParallel ProgrammingEmbedded Systems

Other Skills

AlgorithmsAssembly implementationBig Data analyticsC++AMPCUTLASSComputer ArchitectureDSP FirmwareDebuggingDeepseek's V3 and R1 modelsDigital Signal ProcessingExascale computingFP8 datatypeFirmwareGEMM optimizationsGEMMs

About

• Ph.D. Computer Science • UCSD Powell Fellow • IIT Roorkee EE Gold Medalist • GPUs • Parallel Programming • Compilers • Software • Skills: Parallel Programming: C++AMP, OpenCL, CUDA (in decreasing order of time spend with each language) Compiler Frameworks: MLIR, LLVM, Clang, GCC Programming Languages: C, C++, Python

Experience

17 yrs 1 mo
Total Experience
2 yrs 9 mos
Average Tenure
1 yr 3 mos
Current Experience

Magic

Member of Technical Staff

Jan 2025Present · 1 yr 3 mos · San Francisco Bay Area · Hybrid

  • Working on scaling test-time compute, long context, and reinforcement learning. We are a small group of engineers and researchers working to solve a short list of fundamental problems. If this sounds interesting, we would love to hear from you.

Meta

Software Engineer

Jun 2024Dec 2024 · 6 mos · Menlo Park, California, United States

  • ◦ Improving performance for Llama 70B and 405B models on NVIDIA H100.
  • ◦ Scaling techniques to make FP8 datatype work efficiently with acceptable numerical accuracy.
  • ◦ FP8 GEMMs with blockwise scaling on NVIDIA GPUs that can be extended to groupwise scaling.
  • ◦ FP8 Blockwise Scaling GEMM work is also used in Deepseek's V3 and R1 models.
  • ◦ I left Meta on Dec 27, 2024 to explore the space of test-time compute, long-context, and reinforcement learning.
NVIDIA H100FP8 datatypeGEMMsDeepseek's V3 and R1 modelsGPUCompilers

Google

2 roles

Software Engineer

Sep 2023May 2024 · 8 mos · Mountain View, California, United States

  • ◦ Improving GPU performance and programmability for LLMs and more.
  • ◦ Planning, scoping, and evaluating the codegen efforts for NVIDIA Hopper architecture.
GPU performanceNVIDIA Hopper architectureGPUCompilers

Software Engineer

Jun 2022Sep 2023 · 1 yr 3 mos · Mountain View, California, United States

  • ◦ MLIR/LLVM codegen in OpenXLA compiler
  • ◦ Codgen GEMMs for NVIDIA A100 Tensor Cores using OpenXLA/IREE MLIR compiler to match handwritten (CUTLASS) and library (cuBLAS) performance.
  • ◦ Improved half-precision and single-precision NVIDIA A100 codegen performance from 144 to 238 TFLOPs and 77 to 118 TFLOPs, respectively.
  • ◦ Also implemented support for batch matmul, split-k, bfloat16, and mixed-precision datatype.
MLIRLLVMOpenXLANVIDIA A100CompilersParallel Programming

Nvidia

Software Engineer

Dec 2017May 2022 · 4 yrs 5 mos · Santa Clara, California, United States

  • One of the early engineers working on the CUTLASS project from 2017 to 2022, when the team operated with a headcount of 3-4 engineers.
  • During this time, accelerating ML on four generations of NVIDIA GPU architectures (Volta [V100], Turing, Ampere [A100], and Hopper [H100]).
  • Notable highlights
  • Many GEMM optimizations including split-k (serial and parllel), complex gaussian GEMMs, improvements to TensorFloat32 (TF32) GEMMs.
  • Designed and implemented Implicit GEMM convolution supporting Fprop, Dgrard, and Wgrad.
  • Improved backward strided data gradient by 4x vs. cuDNN 8.2.
  • Presented the work at GTC 2021 and GTC 2022 (slides and talk in the links below).
CUTLASSTensorFloat32GEMM optimizationsCompilersParallel Programming

Amd

Software Engineer

Jan 2016Sep 2016 · 8 mos · Austin, Texas Metropolitan Area

  • Compiler passes using LLVM for GPU kernels written in C++AMP
  • Reliability vs. performance trade-offs study for memories
  • Worked on exascale computing project as a part of resiliency and reliability team

Uc san diego

2 roles

Teaching Assistant

Mar 2014May 2015 · 1 yr 2 mos · San Diego, California, United States

  • Basic Data Structure & Object Oriented Design, Teaching Assistant, Fall 2015
  • Software for Embedded Systems, Teaching Assistant, Spring 2014

Graduate Researcher

Aug 2010Sep 2017 · 7 yrs 1 mo · San Diego, California, United States

  • Static taint analysis for CUDA kernels using LLVM
  • Software-based recovery mechanisms for code executing on unreliable hardware
  • Compiler infrastructure to analyze and generate reliable code
  • Big Data analytics and improving Hadoop performance via reuse
Static taint analysisBig Data analyticsHadoop performanceCompilersParallel Programming

Qualcomm

Firmware Engineer

Aug 2008Jul 2010 · 1 yr 11 mos

  • DSP Firmware Team: Assembly implementation of audio codecs.
  • Coded and integrated features such as Dolby Digital decoder, echo-canceller & noise suppressor,
  • automatic gain control, and IIR filters on Qualcomm ADSP
  • Acquired hands-on knowledge of processor architecture, instruction decoding, memory management,
  • OS fundamentals, assembly & C language, RTOS understating, design and development
  • of firmware, system simulation and software testing using Tcl programming and JTAGs
  • Worked on audio codecs as a part of DSP firmware team
LLVMC++AMPExascale computingCompilersParallel Programming

Air india limited

Intern

May 2007Jun 2007 · 1 mo

  • Studied electrical, aeronautical, and hydraulic systems of A-320 aircraft.
DSP FirmwareAssembly implementationEmbedded Systems

Engineers india limited

Intern

May 2006Jun 2006 · 1 mo · Greater Delhi Area

  • The project involved installation of new transformers, motors, generators and other electrical equipment. I was responsible for transformer sizing for given load considering the impact of highest rated motor. The final system was modeled in Auto-CAD.

Education

UC San Diego

PhD — Computer Science

Jan 2010Jan 2017

UC San Diego

Master's degree — Computer Science

Jan 2010Jan 2012

Indian Institute of Technology, Roorkee

B.Tech — Electrical Engineering (Gold Medalist)

Jan 2004Jan 2008

Stackforce found 100+ more professionals with Gpu & Compilers

Explore similar profiles based on matching skills and experience