J

Jaspreet S.

DevOps Engineer

North York, Ontario, Canada10 yrs 1 mo experience
AI ML PractitionerAI Enabled

Key Highlights

  • Expert in architecting large-scale HPC infrastructures.
  • Proficient in optimizing AI workloads with GPU technologies.
  • Strong background in CI/CD integration for infrastructure.
Stackforce AI infers this person is a High Performance Computing and Infrastructure specialist with a focus on AI and cloud technologies.

Contact

Skills

Core Skills

High Performance Computing (hpc)Infrastructure

Other Skills

GPUHPCAIAnsibleLSFSlurmGitLab RunnerJenkinsPrometheusWekaPure StorageNVIDIA V100H100AI/MLdeep learning

About

Architect, install and manage large-scale enterprise environment and optimize HPC workloads

Experience

10 yrs 1 mo
Total Experience
5 yrs
Average Tenure
--
Current Experience

Tenstorrent

Sr Engineer, Developer Infrastructure

Apr 2023Present · 3 yrs 2 mos · Toronto, Ontario, Canada · Hybrid

  • Architect and operate large-scale GPU and HPC infrastructure supporting AI development, validation, and CI/CD workflows in datacenter-style environments
  • Administer GPU/HPC clusters using LSF and Slurm, including system provisioning, scheduler tuning, capacity expansion, and performance optimization.
  • Lead automated bare-metal and cluster deployments using Ansible, enabling rapid and consistent system bring-up at scale.
  • Integrate infrastructure with CI/CD pipelines (GitLab Runner, Jenkins) to support automated testing and validation workflows.
  • Deploy and maintain monitoring and observability platforms using Prometheus and Node Exporter for real-time system visibility.
  • Design and operate high-throughput, low-latency storage platforms (Weka, Pure Storage) optimized for GPU-intensive validation and AI workloads.
  • Collaborate cross-functionally with engineering, QA, DevOps, and operations teams to deliver unified infrastructure platforms.
GPUHPCAIAnsibleLSFSlurm+7

Qualcomm

2 roles

IT Engineer, Staff

Promoted

Dec 2021Feb 2023 · 1 yr 2 mos

  • Accountable for availability, latency, performance, efficiency, change management, monitoring, and emergency response for over 6 global grids with more than 1.6 million cores.
  • Managed and supported a large-scale LSF cluster equipped with NVIDIA V100 and H100 GPUs, optimizing performance for AI/ML, deep learning, and HPC workloads.
  • Administer global computing resource management systems and services such as LSF and License Scheduler.
  • Oversee LSF operations in Qualcomm's regional computing environments, while serving as an escalation point for Engineering IT leads.
  • Perform LSF and License Scheduler configuration changes in large clusters using GIT.
  • Collaborating with frontline support teams by troubleshooting and resolving Tier 3 LSF and general DRM issues related to service availability, performance, and SLA compliance.
  • Interfacing with other Engineering Compute teams to drive follow-through of issue impacting GAT-managed services around the world.
  • Decommission EOL servers and clusters to reduce licensing cost and promote effective utilization of compute resources.
LSFNVIDIA V100H100AI/MLdeep learningLicense Scheduler+3

IT Engineer, Lead

Dec 2017Nov 2021 · 3 yrs 11 mos

  • Troubleshooting of computing environment issues contributing to impaired resource management
  • Create clear search criteria, report the compiled data and develop alerts within Splunk.
  • Assisting in defining and documenting global grid environment standards.
  • Supporting the LSF windows cluster environment.
  • Perform LSF cluster installation and patch deployment across global grids.
  • Provide critical engineering support during chip phase out or BTO/MTO.
SplunkLSFWindows cluster environment

Xilinx

Lead Systems Administrator

Mar 2017Nov 2017 · 8 mos · On-site

  • Tier 3 Linux support administrator
  • Managing large (1000+ node) compute clusters using LSF or similar job schedulers.
  • Administering network services such as DNS, NIS, NFS, LDAP, sendmail, ftp, rsync and SSH.
  • Good knowledge of TCP/IP networking fundamentals
  • Good experience with Linux imaging and configuration management using puppet, pxe, kickstart.
  • Managing the Global LSF farm.
  • Experienced with capacity planning, utilization review, and performance monitoring.
  • Data Center point of contact for all high severity issues.
LSFDNSNISNFSLDAPTCP/IP+1

Amazon web services (aws)

Cloud Data Center Lead & HW Engineer

Nov 2016Feb 2017 · 3 mos · On-site

  • Help build the world’s largest Cloud infrastructure
  • Escalation point and technical troubleshooter for all Systems and Network hardware problems
  • Remediation of physical layer outages, both Systems & Network
  • Remediation or recovery of physical power issues on racks
  • Experience Data Center layout, power, cooling, and rack space management
  • Install & configure racks of hosts in line with internal SLAs
  • Triage & resolve trouble tickets for all devices in the region
  • Data Center point of contact for all High Severity issues
  • Physical replacement of server and network device parts
  • Ensure correct rotation of parts & spares
  • Engage with Remote Hands & Eyes in EU Regional Cloudfront POPs
  • Knowledge of AWS products: EC2, EBS, S3 etc.
AWSData CenterSystems and Network hardware

Xilinx

2 roles

Systems Administrator 2

Promoted

Jan 2014Oct 2016 · 2 yrs 9 mos · On-site

  • Platform LSF Administrator: Accountable for availability, latency, performance, efficiency, change management, monitoring, and emergency response of our managed LSF farm and services.
  • IT Hardware procurement and cost analysis as per engineering forecast and planning.
  • Experience of working with UCS/Dell/HP Blade servers.
  • Good understanding of Data Center layout, power, cooling and rack space management.
  • Understanding of networking/distributed computing environment concepts including NFS and DNS
LSFUCSDellHP Blade servers

Systems Administrator 1

Sep 2011Dec 2013 · 2 yrs 3 mos · On-site

  • Provide advanced Linux (RHEL, CentOS, Ubuntu, SUSE) troubleshooting and technical support to the Engineering organization.
  • Experience with and understanding of virtualized environments (VMware, Citrix Xen desktop environment, vSphere, Vcenter, ESX server).
  • System installation and configuration, high performance computing, security, installing third-party software in very large environments.
  • Participated in support case management related to Engineering infrastructure, in 24x7 on-call support rotation schedule.
LinuxVMwareCitrix XenvSphere

Education

Maharshi Dayanand University

Bachelor of Technology - BTech — Electrical and Electronics Engineering

Jan 2007Jan 2011

Stackforce found 100+ more professionals with High Performance Computing (hpc) & Infrastructure

Explore similar profiles based on matching skills and experience