Nitin Singh

AI Researcher

Bengaluru, Karnataka, India6 yrs 3 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Expert in GPU programming and deep learning optimization.
  • Led development of real-time BCI data processing SDK.
  • Proven track record in enhancing AI model performance.
Stackforce AI infers this person is a specialized AI Infrastructure Engineer with a focus on deep learning and BCI applications.

Contact

Skills

Core Skills

Gpu ProgrammingDeep LearningC++Machine LearningData Science

Other Skills

PyTorchDistributed TrainingPython 3Object Oriented DesignSignal ProcessingTensorFlowCUDAPythonData VisualizationElectrical EngineeringInterpersonal SkillsGPULarge Language Models (LLM)AI workloadsDistributed Inference

About

AI Systems Performance Engineer specializing in deep learning compilers and accelerator efficiency. I build infrastructure that enables large-scale models to train and execute efficiently on modern hardware. At the PyTorch framework level, I focus on graph optimization, dynamic shape handling, and operator clustering to improve performance of LLM workloads. I also work across distributed training and inference, debugging and extending collective communication backends to strengthen FSDP correctness, scalability, and multi-node performance. At the kernel layer, I design CUTLASS-SYCL compute primitives for Intel GPUs, implementing low-precision and quantized kernels optimized for memory bandwidth, compute utilization, and architectural characteristics. Earlier in my career, I built production C++ systems and real-time ML pipelines for Brain-Computer Interface applications, developing strong foundations in signal processing, linear algebra, and performance-critical systems engineering.

Experience

6 yrs 3 mos
Total Experience
2 yrs 1 mo
Average Tenure
2 yrs 6 mos
Current Experience

Intel corporation

AI Software Solutions Engineer

Nov 2023Present · 2 yrs 6 mos · Bengaluru, Karnataka, India · On-site

  • Building AI software infrastructure at Intel across the full stack — from GPU compute kernels and hardware-level tuning to PyTorch compiler passes and distributed training.
  • On the kernel side, I build CUTLASS/SYCL-based GEMM infrastructure — covering grouped operations, low-precision quantized workloads (FP8, MXFP4), and fused post-GEMM operations — with architecture-specific tuning for Intel Xe2/Xe4 GPUs to maximize compute throughput and memory bandwidth.
  • On the framework side, I work on PyTorch graph optimization that encompasses symbolic shape handling, graph clustering, and dynamic shape processing — which reducing host-side graph compilation overhead by 26% for LLM inference workloads like Llama.
  • I also work on distributed training infrastructure, debugging and extending collective communication backends (HCCL/XCCL) to improve FSDP training & inference correctness and scalability across multi-GPU setups.
C++PyTorchGPU ProgrammingDeep Learning

Nexstem

2 roles

C++ Software Engineer, R&D Lead

Promoted

Oct 2021Oct 2023 · 2 yrs · On-site

  • Led the development of a cross-platform SDK designed for real-time, high-sample-rate multivariate EEG streaming. The SDK facilitates the creation of concurrent data processing pipelines for Brain-Computer Interface (BCI) applications and was built initially for x86 Linux, with later adaptations for aarch64 Linux. The project involved the design and implementation of signal filtering, processing, visualization, and machine learning algorithms in C++, as well as a sensor-less motor-position control algorithm that enhanced the signal-to-noise ratio for more accurate EEG data.
Python 3Object Oriented DesignC++Machine Learning

Machine Learning Specialist

Sep 2020Sep 2021 · 1 yr · On-site

  • Developed a python toolkit for user authentication, real-time data filtering, processing, and visualization across multiple plot types. Also created interactive games for real-time labeled data collection in BCI experiments. The resulting processing pipelines and models supported a variety of BCI paradigms, including motor imagery classification, emotion classification, visually evoked stimulus classification, and lie detection, with models achieving per-subject median accuracies exceeding 85%. Also designed the data filtering & cleaning, feature extraction, training, and inference pipelines for those models.
TensorFlowCUDAMachine LearningData Science

Larsen and toubro construction

Graduate Engineer Trainee

Jun 2016Mar 2017 · 9 mos · Ghaziabad, Uttar Pradesh, India

  • Planning Department at a Government of India project for feeder segregation. Responsibilities included supply chain management, data analysis, and project monitoring. Developed data projection tools in Python to optimize the supply chain flows.
Electrical EngineeringInterpersonal Skills

Education

National Institute of Technology Goa

M. Tech — Power Electronics & Power Systems

Jan 2018Jan 2020

National Institute of Technology Goa

B. Tech — Electrical and Electronics Engineering

Jan 2012Jan 2016

Stackforce found 100+ more professionals with Gpu Programming & Deep Learning

Explore similar profiles based on matching skills and experience