Kali Uday B.

CTO

Bengaluru, Karnataka, India10 yrs 11 mos experience

AI ML PractitionerAI Enabled

Key Highlights

Over a decade of experience in HPC and AI.
Expert in performance tuning and kernel optimizations.
Led teams in developing cutting-edge AI applications.

Stackforce AI infers this person is a specialist in HPC and AI, focusing on performance optimization and FPGA acceleration.

Contact

Skills

Core Skills

Ai/ml Application DevelopmentPerformance BenchmarkingFpga AccelerationCompression AlgorithmsData CompressionAi/ml AccelerationCnn InferenceVideo EnhancementGpu Acceleration

Other Skills

AIPCAlgorithm DesignApple MetalArtificial Intelligence (AI)Attention MechanismsBERT (Language Model)BioinformaticsCC++C/C++CUDACode OptimizationDPC++Data StructuresDigital Image Processing

About

An experienced High-Performance Computing (HPC & Artificial Intelligence) R&D Engineer spanning a successful career over a decade in various reputed semi-conductor & HPC/AI based organizations. Specialized in application acceleration, performance tuning and kernel optimizations targeting different HPC architectures (Multi-core, Many Core (GPUs) and Spatial (FPGA)). Key Technical Skills: - C/C++, Python, Julia, Rust and Scripting - Parallel Programming (MPI, OpenMP, CUDA, SYCL, OpenCL and Vitis High Level Synthesis) - Expert in building RAG based genetic AI systems on-premise. - ML Frameworks - pyTorch, Tensorflow, JAX, OpenVino, DirectML, DeepSpeed - AI/ML Compiler Development - Triton - Artificial Intelligence, Data Compression, Genomics and Networking - Application tuning, Performance and Power analysis on CPU, GPU and FPGA architectures - HPC Architecture Design and Implementation - Large Scale Cluster & Grid Computing - Algorithm Optimization - Exposed to various analyzers (Intel (VTune, APS, GTPin and Advisor), Nvidia (Nsight Compute, NSys, NvProf), AMD (Vitis, Codexl and vivado) - Experienced in development on Linux and Windows Key Non-Technical Skills: - Possess excellent presentation skills - Mentored engineers at various levels - Lead a team of size ranging from 3 to 7 members - Played a key role in establishing teams on multiple occasions - Demonstrated customer first approach

Experience

Applied materials

Deputy Director Software (HPC/AI)

Aug 2025 – Present · 7 mos · Bengaluru, Karnataka, India · On-site

Intel corporation

Staff Software Engineer (HPC/AI)

Sep 2022 – Aug 2025 · 2 yrs 11 mos · Hyderabad, Telangana, India · On-site

AI/ML Application Development on Linux/Windows
Enable OpenAI Triton Backend for Intel GPUs, Triton (AI/ML) Compiler Development
Development of AIPC based GPU/NPU AI/ML applications on Linux/Windows
Lead the team of 6+ engineers on OpenVino and AI workloads development targeting GPU/NPUs
Enable oneAPI for Julia Programming Language
Enable oneAPI for RUST Programming Language
Enable ChipStar benchmarks (HeCBench) and customer enablement.
Key member in handling a customer project of Intel Analyzers team targeting large scale HPC clusters
Performance Benchmarking and Tuning of applications in ML domain targeting Intel GPUs (SYCL/DPC++)
Improve quality of Intel GPU software stack (Level Zero / oneAPI)
Received multiple recognitions (3 Division Level Recognition Awards, Executive level: CustomerFirst, oneIntel and ResultsDriven) for my contributions on product improvement & pathfinding work.
Mentored junior engineers on HPC & AI
Hands On: C/C++, SYCL, CUDA, OpenCL, Level Zero, SYCL, JuliaGPU, Apple Metal and RUST
AI/ML Frameworks: Triton, PyTorch, OpenVino, TensorFlow, JAX, DirectML

C/C++SYCLCUDAOpenCLLevel ZeroJuliaGPU+10

Xilinx

2 roles

Senior Software Engineer - II

Promoted

May 2019 – Sep 2022 · 3 yrs 4 mos

~ Architect & implemented highly optimized FPGA accelerated Compression library (OpenCL/HLS) consisting of various well-known algorithms such as GZip, LZ4, ZLIB and Snappy.
~ Main architect of Libz compression library for Cloudera-Hadoop acceleration (Map/Reduce) framework and entire software stack.
~ Migrated various Genomic Pipeline Algorithms from Xilinx SDAccel to Vitis targeting Alveo U50
~ Owner of Network Traffic Generator Simulating Alveo Network Enabled FPGA (X3) architecture which helps in software/hardware emulation.
~ Enabled various key customers in data compression acceleration targeting various discreet FPGA (Alveo SmartSSD, U50 and U250) platforms.
~ Mentored various engineers/interns within and outside team on discreet FPGA acceleration using OpenCL/HLS software stack on Vitis
~ Lead a project to publish FPGA accelerated GZip app on on-premise, docker and AWS for quick customer adoption.
~ Received a customer appreciation and got nominated for an award world-wide on SmartSSD acceleration of LZ4 application delivered in quick time.
~ Successfully conducted GO-PYNQ contest - IIT Kharagpur (Kshitiz)
Hands On: C/C++, Python, Scripting, OpenCL and HLS. Experience in AMD Vitis Tool

C/C++PythonOpenCLHLSFPGA AccelerationCompression Algorithms

Senior Software Engineer - I

Jan 2017 – May 2019 · 2 yrs 4 mos

~ Worked on Xilinx heterogeneous and embedded FPGAs (HLS & OpenCL).
~ Developed high quality SDx (SDAccel & SDSoC) on-boarding applications which show cases new features and best practices for end user of SDx tool.
~ Accelerated data compression algorithms using SDAccel OpenCL targeting Xilinx Pcie FPGA cards
~ Provided solutions to new users on Xilinx SDAccel & SDSoC forums.
~ Successful in building data center adoptable FPGA accelerated data compression applications such as GZip and LZ4. Targeted FPGA cloud AWS F1, Alibaba and Nimbix Cloud
Hands On : C/C++, Shell scripting and Xilinx SDx [SDAccel & SDSoC] (C/C++, HLS and OpenCL)

C/C++OpenCLFPGA AccelerationData Compression

Multicoreware

2 roles

Senior Software Engineer

Promoted

May 2015 – Jan 2017 · 1 yr 8 mos

~ Developed a OpenCL- FPGA acceleration of CNN inference AI workload
~ Worked on-site in China on Google-VP9 decoder acceleration using Renderscript
~ Implemented a demo presenting vehicle detection using FPGAs with Xilinx-SDAccel/OpenCL environment.
~ Worked on Drone-based face classifier using CNNs on Jetson TK1 & K40 (CUDA)
~ Published a research article & presented a poster in High Performance Computing Conference (Hipc-2015)
~ Worked on acceleration of Genome-Sequence alignment us SDAccel (HLS & OpenCL flows).
~ Worked on Optimization of Alexnet model using Imagenet data for Virtex7 FPGA using SDAccel.
Hands On : C/C++, OpenCL

C/C++OpenCLAI/ML AccelerationCNN Inference

Software Engineer

May 2013 – May 2015 · 2 yrs

~ Implemented a Video enhancement application targeting FPGA.
~ Worked on acceleration of VP9 decoder using Renderscript & OpenCL for Mobile GPU.
~ Worked on memory optimizations of few blocks in VP9 decoder (inter, intra and loopfilter)
~ Worked on power optimizations of VP9 mobile decoder
~ Lead a team of 6 members working on ray tracing application acceleration (Sep/2014 - Mar/2015).
~ Designed and implemented multiple GPU kernels and achieved best performance for cycles render.
~ Implemented concurrent OCL kernels which utilizes multiple hardware. (CPU & GPU)
~ Published a research article on medical imaging in IEEE sponsored conference along with an intern.
Hands on : C/C++, OpenCL, CUDA, Google-Renderscript and Java