Jaspreet S.

DevOps Engineer

North York, Ontario, Canada10 yrs 1 mo experience

AI ML PractitionerAI Enabled

Key Highlights

Expert in architecting large-scale HPC infrastructures.
Proficient in optimizing AI workloads with GPU technologies.
Strong background in CI/CD integration for infrastructure.

Stackforce AI infers this person is a High Performance Computing and Infrastructure specialist with a focus on AI and cloud technologies.

Contact

Skills

Core Skills

High Performance Computing (hpc)Infrastructure

Other Skills

GPUHPCAIAnsibleLSFSlurmGitLab RunnerJenkinsPrometheusWekaPure StorageNVIDIA V100H100AI/MLdeep learning

About

Architect, install and manage large-scale enterprise environment and optimize HPC workloads

Experience

10 yrs 1 mo

Total Experience

5 yrs

Average Tenure

Current Experience

Tenstorrent

Sr Engineer, Developer Infrastructure

Apr 2023 – Present · 3 yrs 2 mos · Toronto, Ontario, Canada · Hybrid

Architect and operate large-scale GPU and HPC infrastructure supporting AI development, validation, and CI/CD workflows in datacenter-style environments
Administer GPU/HPC clusters using LSF and Slurm, including system provisioning, scheduler tuning, capacity expansion, and performance optimization.
Lead automated bare-metal and cluster deployments using Ansible, enabling rapid and consistent system bring-up at scale.
Integrate infrastructure with CI/CD pipelines (GitLab Runner, Jenkins) to support automated testing and validation workflows.
Deploy and maintain monitoring and observability platforms using Prometheus and Node Exporter for real-time system visibility.
Design and operate high-throughput, low-latency storage platforms (Weka, Pure Storage) optimized for GPU-intensive validation and AI workloads.
Collaborate cross-functionally with engineering, QA, DevOps, and operations teams to deliver unified infrastructure platforms.

GPUHPCAIAnsibleLSFSlurm+7

Qualcomm

2 roles

IT Engineer, Staff

Promoted

Dec 2021 – Feb 2023 · 1 yr 2 mos

Accountable for availability, latency, performance, efficiency, change management, monitoring, and emergency response for over 6 global grids with more than 1.6 million cores.
Managed and supported a large-scale LSF cluster equipped with NVIDIA V100 and H100 GPUs, optimizing performance for AI/ML, deep learning, and HPC workloads.
Administer global computing resource management systems and services such as LSF and License Scheduler.
Oversee LSF operations in Qualcomm's regional computing environments, while serving as an escalation point for Engineering IT leads.
Perform LSF and License Scheduler configuration changes in large clusters using GIT.
Collaborating with frontline support teams by troubleshooting and resolving Tier 3 LSF and general DRM issues related to service availability, performance, and SLA compliance.
Interfacing with other Engineering Compute teams to drive follow-through of issue impacting GAT-managed services around the world.
Decommission EOL servers and clusters to reduce licensing cost and promote effective utilization of compute resources.

LSFNVIDIA V100H100AI/MLdeep learningLicense Scheduler+3

IT Engineer, Lead

Dec 2017 – Nov 2021 · 3 yrs 11 mos

Troubleshooting of computing environment issues contributing to impaired resource management
Create clear search criteria, report the compiled data and develop alerts within Splunk.
Assisting in defining and documenting global grid environment standards.
Supporting the LSF windows cluster environment.
Perform LSF cluster installation and patch deployment across global grids.
Provide critical engineering support during chip phase out or BTO/MTO.

SplunkLSFWindows cluster environment

Xilinx

Lead Systems Administrator

Mar 2017 – Nov 2017 · 8 mos · On-site

Tier 3 Linux support administrator
Managing large (1000+ node) compute clusters using LSF or similar job schedulers.
Administering network services such as DNS, NIS, NFS, LDAP, sendmail, ftp, rsync and SSH.
Good knowledge of TCP/IP networking fundamentals
Good experience with Linux imaging and configuration management using puppet, pxe, kickstart.
Managing the Global LSF farm.
Experienced with capacity planning, utilization review, and performance monitoring.
Data Center point of contact for all high severity issues.

LSFDNSNISNFSLDAPTCP/IP+1

Amazon web services (aws)

Cloud Data Center Lead & HW Engineer

Nov 2016 – Feb 2017 · 3 mos · On-site

Help build the world’s largest Cloud infrastructure
Escalation point and technical troubleshooter for all Systems and Network hardware problems
Remediation of physical layer outages, both Systems & Network
Remediation or recovery of physical power issues on racks
Experience Data Center layout, power, cooling, and rack space management
Install & configure racks of hosts in line with internal SLAs
Triage & resolve trouble tickets for all devices in the region
Data Center point of contact for all High Severity issues
Physical replacement of server and network device parts
Ensure correct rotation of parts & spares
Engage with Remote Hands & Eyes in EU Regional Cloudfront POPs
Knowledge of AWS products: EC2, EBS, S3 etc.

AWSData CenterSystems and Network hardware

Xilinx

2 roles

Systems Administrator 2

Promoted

Jan 2014 – Oct 2016 · 2 yrs 9 mos · On-site

Platform LSF Administrator: Accountable for availability, latency, performance, efficiency, change management, monitoring, and emergency response of our managed LSF farm and services.
IT Hardware procurement and cost analysis as per engineering forecast and planning.
Experience of working with UCS/Dell/HP Blade servers.
Good understanding of Data Center layout, power, cooling and rack space management.
Understanding of networking/distributed computing environment concepts including NFS and DNS

LSFUCSDellHP Blade servers

Systems Administrator 1

Sep 2011 – Dec 2013 · 2 yrs 3 mos · On-site

Provide advanced Linux (RHEL, CentOS, Ubuntu, SUSE) troubleshooting and technical support to the Engineering organization.
Experience with and understanding of virtualized environments (VMware, Citrix Xen desktop environment, vSphere, Vcenter, ESX server).
System installation and configuration, high performance computing, security, installing third-party software in very large environments.
Participated in support case management related to Engineering infrastructure, in 24x7 on-call support rotation schedule.

LinuxVMwareCitrix XenvSphere