Nimit Nigania

Lead ML Engineer

San Francisco, California, United States14 yrs 9 mos experience

Highly Stable

Key Highlights

Expert in GPU optimization and performance engineering.
Proven track record in deep learning model deployment.
Significant contributions to MLPerf submissions.

Stackforce AI infers this person is a Machine Learning Engineer with a strong focus on GPU optimization and high-performance computing.

Contact

nimitnigania@gmail.com LinkedIn

Skills

Core Skills

Machine LearningGpu OptimizationDeep LearningData EngineeringGpu Software EngineeringResearchInternshipSoftware Engineering

Other Skills

Analog Circuit DesignArchitectural simulatorsCC++CUDACadence VirtuosoComputer ArchitectureComputer ScienceDebuggingDigital Circuit DesignElectrical EngineeringEmbedded SystemsEmbedded systemsFPGAFeature engineering

About

I am a Machine Learning Engineer specializing in building and accelerating deep learning models for high-performance production environments. My passion lies in bridging the gap between cutting-edge model development and the underlying hardware, transforming computationally expensive models into efficient, low-latency applications.My experience spans the full ML lifecycle, from designing and training novel architectures in PyTorch and JAX to deploying them at scale. What sets my work apart is a deep expertise in performance engineering and GPU optimization. I have hands-on experience profiling with tools like Nsight, writing custom CUDA kernels for critical performance bottlenecks, and leveraging frameworks like TensorRT and Triton Inference Server to slash inference latency. I'm adept at techniques such as model quantization (INT8/FP8), kernel fusion, and optimizing GPU memory bandwidth to maximize throughput.

Experience

14 yrs 9 mos

Total Experience

2 yrs 10 mos

Average Tenure

10 mos

Current Experience

Snap inc.

ML Engineering Lead

Jul 2025 – Present · 10 mos · Palo Alto, CA · On-site

Leading ML GPU optimization effort.

GPU optimizationMachine Learning

Google

Machine Learning Engineer

Feb 2019 – Jul 2025 · 6 yrs 5 mos · Mountain View, California

2019 - 2022: ML at Google brain. Working on Tensorflow, Pytorch performance on GPUs. Improving ML models like NCF, BERT, resnet50. Official Bert GPU submission to MLPerf / MLCommons. Also wrote some CUDA kernels for optimizations.
2022 - present: Ads Machine Learning (Youtube): Improving model quality to drive conversions and revenue for YouTube Ads. Use new data sources, engineer features, leverage LLMs, and architect current recommendation models to represent users more accurately.

TensorFlowPyTorchCUDAMLPerfFeature engineeringLLMs+2

Apple

GPU Software Engineer

Feb 2014 – Feb 2019 · 5 yrs · Cupertino

GPU modeling.

GPU modelingGPU Software Engineering

Intel corporation

2 roles

Graduate Intern

May 2013 – Aug 2013 · 3 mos · Portland, Oregon Area

Worked with the Xeon Phi group. Added a key feature to the performance model to better utilize the memory system.

Performance modelingMemory system utilizationInternship

Software Engineer

Jul 2011 – Aug 2012 · 1 yr 1 mo · Bengaluru Area, India

Worked with the Many Integrated Core(MIC) group on the performance characterization/debug of future many core processors. Gained more in depth knowledge in computer architecture, simulators and performance analysis of high performance computers.

Performance characterizationDebuggingSoftware Engineering

Georgia institute of technology

Graduate Research Assistant

Aug 2012 – Jan 2014 · 1 yr 5 mos · Greater Atlanta Area

Worked on novel techniques to estimate performance and power of GPU/CPUs by using analytical models and architectural simulators.

Performance estimationArchitectural simulatorsResearch

University of heidelberg

Research Intern

May 2010 – Jun 2010 · 1 mo · Heidelberg Area, Germany

Worked on embedded serializers to be used with an FPGA for the ATLAS experiment at CERN.

Embedded systemsFPGAResearch

Georgia institute of technology

Research Intern

May 2009 – Jul 2009 · 2 mos · Atlanta,GA

Worked with the HPC (high performance computing group) and the computer architecture group in a joint project on the combined benefits of prefetching used in software and hardware together.

High Performance ComputingComputer ArchitectureResearch

Xilinx

Summer Intern

May 2008 – Jul 2008 · 2 mos · Hyderabad Area, India

Worked with the APD (advanced product division) Memory team to come up with a testbench for the DDR2 memory controller by integrating it with a Microblaze processor.

Memory controllerTestbench integrationInternship