VIRENDER KUMAR — Software Engineer

Having 12+ Years of experience as a High-Performance Computing Cluster System administrator. - Working experience with CRAY/Lenovo/DELL/IBM/Nvidia/DDN Hardware - HPC system installation and commissioning - Expertise in the installation of HPC cluster middleware, HPC development tools and job schedulers, resource managers, and all kinds of software. - Expertise in Linux OS, InfiniBand & and networking - Servers’ hardware installation, support, and troubleshooting - Monitor and troubleshoot HPC systems onsite/ remotely - Expertise in High-End Servers, Storage, Backup solutions, InfiniBand NDR, HDR, EDR, FDR, QDR Switches, Fabric Switches, Unix, Linux & High-Performance Computing Cluster domain, and integration. - Experience in GPFS, Lustre, IBM-XCAT, LSF, Loadlevler, PBSPro, Slurm, AIX, Linux, UNIX, Scientific Linux, SESL, HPC applications, and Benchmarking. - Experience in Security Vulnerability, Data-Management & Monitoring tools - Skilled in an integrated solution, installation, and configuration of hardware and software. - Providing technical guidance and support to a major High-Performance Computing environment. - Advanced systems support for a large-scale, supercomputing center that includes installation, integration, and management of high-performance computer systems. - Ability to work well independently and as a part of a team, provide guidance to the team and can work with another business unit. - 8+ work experience with CRAY Supercomputer HPE Company, IBM India Pvt. Ltd & HCL Infosystem Ltd ............ ITIL4 Foundation certified REDHAT Certified System Administrator Certification id 110-364-361 REDHAT Certified Engineer LINUX 5 Certification id 805011550158863 AIX 7 certified Candidate\Testing ID: IBM000007921 Cloud U rack Certification(Cloud Computing) Azure Administration Essential Training Oracle OCI Foundations 2021 Associate Certification Infiniband professional training Tivoli Storage Manager 7.1 Servicing the Mellanox EDR 100 Gb InfiniBand 216, 324, and 648 port switches, MT-M 8828-ED0, 8828-ED1, and 8828-ED2 Servicing the IBM POWER8 8335-GCA and 8335-GTA (IBM Power System S822LC) Servicing the Mellanox EDR 100 Gb InfiniBand 216 and 648 port switches, MT-M 8828-ED0 and 8828-ED2 Servicing the Mellanox EDR 100 Gb InfiniBand TOR switches MT-M 8828-E36 and 8828-E37 E-Service Training (EST) Qual-SAN Technology (CISCO) IBM 3584 Enhancements - HA1 and 3588-F3A - Service Training Tape & Tape Library Training Training on Linux Cluster V7K storage training TAS VSM Traning in Basel Switerzland Supercomputing conference Redhat Ansible workshop

Stackforce AI infers this person is a High-Performance Computing Infrastructure Specialist with extensive experience in system administration and management.

Location: San Francisco, California, United States

Experience: 14 yrs 2 mos

Skills

Hpc System Design
Network Administration
Hpc System Management
Data Center Management
Hpc Support
System Administration
Hpc System Administration
Data Management

Career Highlights

12+ years in High-Performance Computing administration
Expertise in HPC system design and management
Proven track record in vendor coordination and support

Work Experience

Biohub

Principal Engineer, HPC (1 yr 3 mos)

University of Chicago, Research Computing Center

Principal HPC Engineer (4 yrs 4 mos)

Hewlett Packard Enterprise

Technical Support Manager (9 mos)

Cray Inc.

HPC Team Lead (2 yrs 3 mos)

IBM

HPC Support Engineer (4 yrs 6 mos)

HCL Infosystems Ltd

HPC system admin (1 yr 1 mo)

Education

Bachelor of Engineering (B-Tech) at Maharishi Markandeshwar (Deemed to be University) Official

VIRENDER KUMAR

Software Engineer

San Francisco, California, United States14 yrs 2 mos experience

Highly Stable

Key Highlights

12+ years in High-Performance Computing administration
Expertise in HPC system design and management
Proven track record in vendor coordination and support

Stackforce AI infers this person is a High-Performance Computing Infrastructure Specialist with extensive experience in system administration and management.

Contact

Skills

Core Skills

Hpc System DesignNetwork AdministrationHpc System ManagementData Center ManagementHpc SupportSystem AdministrationHpc System AdministrationData Management

Other Skills

Active DirectoryAutomation scriptsBackup and Archive solutionsCClusterDNSData backup and migrationData center infrastructure managementDocker ProductsFirmware updatesGPFS/Spectrum ScaleGitHACMPHPC cluster implementationHPC security tools

About

Experience

14 yrs 2 mos

Total Experience

2 yrs 7 mos

Average Tenure

1 yr 3 mos

Current Experience

Biohub

Principal Engineer, HPC

Mar 2025 – Present · 1 yr 3 mos · San Francisco Bay Area · On-site

University of chicago, research computing center

Principal HPC Engineer

Nov 2020 – Mar 2025 · 4 yrs 4 mos · Chicago, Illinois, United States

New HPC System designing, development, and Operation
Installation/Deployment/maintenance/troubleshooting of the new & existing HPC hardware (CPU/GPU) system and software
Installation/Deployment/maintenance/troubleshooting the GPFS/Spectrum Scale, ZFS,
Installation/Deployment/maintenance/troubleshooting of the Backup and Archive solution
Installation/Deployment/maintenance/troubleshooting the Network (Ethernet& InfiniBand)
Installation/Deployment/maintenance/troubleshooting the Virtualization/ DevOPs/Hyperscale
Installation/Deployment/maintenance/troubleshooting the HPC security tool like Crowd-Strike, Rapid7 Carbon- black & Wazuh etc
Installation/Deployment/maintenance/troubleshooting: - Globus, SAMBA, CIFS, NFS, Ganesha
Installation/Deployment/maintenance/troubleshooting XCAT/Confluent, Slurm, Check-MK, ThinLinc, 2FA Duo Factor authentication, Various Kind of License servers, Git,
System Firewalld, Gpfs Snapshot policy, etc
Upgrade the HPC system firmware (Switch, Hardware), OS patch, Security Patch, etc
Creating scripts for automating daily routine tasks & health checks by using ansible
Coordinating & following different vendors during Hardware issues, deployment of the new HPC system Hardware, Software, and infrastructure-related issue resolution
Verify the vendor Benchmarking submitted results on different architectures like INTEL, AMD, (GROMACS, LAMMPS, QE MgO, QE Ice, HipMCL, HPCG, HPCC, IMB)
Providing day to day technical support to the users’
Maintaining/Writing/Update all system-related issues and operational documents on Github.
Manage different vendor’s HPC servers (IBM/DELL/Lenovo/NVIDIA/DDN/SM, GPUs)

HPC system designGPFS/Spectrum ScaleBackup and Archive solutionsNetwork troubleshootingVirtualization/DevOpsHPC security tools+4

Hewlett packard enterprise

Technical Support Manager

Jan 2020 – Oct 2020 · 9 mos · New Delhi, Delhi, India · On-site

Responsibilities: -
Manage team responsible for providing on-site hardware, systems, sub-systems, and/or other applications support for customers according to contractual service levels.
Plan, direct, and monitor operational/tactical activities of the team.
Meet business and operation targets.
Recruit and support the development of direct staff members.
Handle customer escalations.
Establish relationships with customers and other functional managers.
Provide guidance on process improvements and recommend changes in alignment with
Provide coaching and leadership to assigned field technicians.

Cray inc.

HPC Team Lead

Oct 2017 – Jan 2020 · 2 yrs 3 mos · Noida Area, India

CRAY XC40 HPC System: -
CRAY XC40 2.8 P/F HPC: - CRAY XC40 HPC (Intel Xeon Broadwell E5-2695 V4 18C 2.1
GHz with 128GB), connected through Cray Aries Dragonfly topology, Total Peak performances 2.8 P/F with 290 TiB memory, home 6.3 PB & 670 TB scratch storage & TFinity SPECTRA Logic Library (19 PB capacity with LTO 7)
Total 2320 compute nodes (83520 cores)
Responsibility: -
Managing & monitoring on-site installation, service & repair of Cray XC40 Cluster machine components Providing 24*7 Maintenance support of 2.8 petaflop HPC machine.
HPC Site Team lead of 2.8 P/F machine
Maintaining Site SLA & providing 24*7 support
Update the system firmware to the latest version (O.S, storage, switches & library, etc)
A part of the team member during the commissioning & installation of the Indian Biggest HPC Project
Managing the Datacenter infrastructure
Lead the localization during UAT & NOA
Managing user account and quota policy, production jobs & PBS queue as per client requirement.
Customize the Node image through BCM (Bright Cluster Manager)
Backup & migration of production data through TAS VSM using spectra Tfinity library
Involved in day-to-day maintenance & client requirement activity
Installation/management of tools like Matlab, CDT, IDL, SVN, FlexNet, etc.
Support weather forecasting models like UM, WRF, NCUM, and GFS
System configuration and management through Automation tools like Ansible and Puppet
Documentation of every system administrator procedure
Coordinate with scientists to change and implement the new models in system
Follow-up with infrastructure vendors

HPC system managementFirmware updatesUser account managementData center infrastructure managementHPC System ManagementData Center Management

Ibm

HPC Support Engineer

Mar 2013 – Sep 2017 · 4 yrs 6 mos · Noida Area, India

IBM IdataPlex 350 T/F HPC Cluster:-
Bhaskara IBM iDataPlex DX360M4(Xeon E5-2670 8C 2.6 GHz, Infini- Band FDR 14) High-performance computing system 350TF with 67 TB aggregate distributed shared Memory with 3 PB of Storage (India’s most powerful supercomputer in 2015). Total 1052 Computes nodes (16832 cores).
Responsibility:
Implementation and maintenance of 350T/F machine.
Installation and Managing General Parallel File System Cluster (GPFS).
Installation and configuration of Job Scheduler-IBM LSF.
Installation and configuration of Platform Application Central for Monitoring and GUI job submission.
Installation and Up-gradation of Scientific Application & Libraries on RHEL Cluster.
Porting Regional and Atmospheric Models (UM, NGEFS, WRF) on RHELCluster.
Creating, Modifying, and distributing User ID across Computing nodes via XCAT.
Install, maintain and support hardware and software components of the HPC environment.
Provide immediate support in implementation, troubleshooting, and maintenance of the HPC system.
Installation and configuration of intel cluster studio, IBM POE & required Modules.
Manage and troubleshoot TSM & HSM servers and clients.
Install and troubleshoot Mellanox InfiniBand FDR 14, Cisco MDS9148 Fabric Switches.
Troubleshooting of TS3500 & Ts1060 Model Tape Drives & library.
Manage and troubleshoot XCAT, LSF, IDL, and MATLAB Server applications.
Experiences in installation & system performance checking through LINPAC, BLAS, HPL, Intel MKL Bench-marking Application.

HPC cluster implementationJob Scheduler installationScientific application supportHPC SupportSystem Administration

Hcl infosystems ltd

HPC system admin

Jan 2012 – Feb 2013 · 1 yr 1 mo · Noida Area, India

IBM P6-575 22T/F HPC System: -
Environment: IBM P6-575 22TF HPC (Each Node with 32 cores of IBM P6 CPU 4.7 GHz with 32GB RAM), a Total of 42 Nodes connected through the InfiniBand network. Peak Speed ~ 22 TFs with ~500TB GPFS storage & IBM 3584 Tape library of 100TB capacity (LTO4)
Responsibility: -
System hardware monitoring, support & repair of IBM P6 (P575) based HPC components and other IBM P-series servers (p560, p570) using Hardware Management Console (HMC).
Log’s collection, identification, Disk Management & replacements in IBM Storage DS4k.
IB cable errors, issue tracing, replacements & troubleshooting in InfiniBand Q-logic Switches.
Production Job Monitoring, Scheduling & configuration using IBM workload scheduler: LoadLeveler.
AIX v5 & v6 Operating System related issues identification & troubleshoot.
Support Weather Forecasting models like UM, WRF, NCUM & GFS.
Data Backup & Migration using IBM TSM (Tivoli Storage Manager) in LTO4 Tape drives.
Troubleshooting Network & Booting problems, Monitoring & performance optimization.
User creation & Modification with file system quota.
Generating Daily, weekly & monthly reports of System utilization using the NMON tool.