VIRENDER KUMAR

Software Engineer

San Francisco, California, United States14 yrs 2 mos experience
Highly Stable

Key Highlights

  • 12+ years in High-Performance Computing administration
  • Expertise in HPC system design and management
  • Proven track record in vendor coordination and support
Stackforce AI infers this person is a High-Performance Computing Infrastructure Specialist with extensive experience in system administration and management.

Contact

Skills

Core Skills

Hpc System DesignNetwork AdministrationHpc System ManagementData Center ManagementHpc SupportSystem AdministrationHpc System AdministrationData Management

Other Skills

Active DirectoryAutomation scriptsBackup and Archive solutionsCClusterDNSData backup and migrationData center infrastructure managementDocker ProductsFirmware updatesGPFS/Spectrum ScaleGitHACMPHPC cluster implementationHPC security tools

About

Having 12+ Years of experience as a High-Performance Computing Cluster System administrator. - Working experience with CRAY/Lenovo/DELL/IBM/Nvidia/DDN Hardware - HPC system installation and commissioning - Expertise in the installation of HPC cluster middleware, HPC development tools and job schedulers, resource managers, and all kinds of software. - Expertise in Linux OS, InfiniBand & and networking - Servers’ hardware installation, support, and troubleshooting - Monitor and troubleshoot HPC systems onsite/ remotely - Expertise in High-End Servers, Storage, Backup solutions, InfiniBand NDR, HDR, EDR, FDR, QDR Switches, Fabric Switches, Unix, Linux & High-Performance Computing Cluster domain, and integration. - Experience in GPFS, Lustre, IBM-XCAT, LSF, Loadlevler, PBSPro, Slurm, AIX, Linux, UNIX, Scientific Linux, SESL, HPC applications, and Benchmarking. - Experience in Security Vulnerability, Data-Management & Monitoring tools - Skilled in an integrated solution, installation, and configuration of hardware and software. - Providing technical guidance and support to a major High-Performance Computing environment. - Advanced systems support for a large-scale, supercomputing center that includes installation, integration, and management of high-performance computer systems. - Ability to work well independently and as a part of a team, provide guidance to the team and can work with another business unit. - 8+ work experience with CRAY Supercomputer HPE Company, IBM India Pvt. Ltd & HCL Infosystem Ltd ............ ITIL4 Foundation certified REDHAT Certified System Administrator Certification id 110-364-361 REDHAT Certified Engineer LINUX 5 Certification id 805011550158863 AIX 7 certified Candidate\Testing ID: IBM000007921 Cloud U rack Certification(Cloud Computing) Azure Administration Essential Training Oracle OCI Foundations 2021 Associate Certification Infiniband professional training Tivoli Storage Manager 7.1 Servicing the Mellanox EDR 100 Gb InfiniBand 216, 324, and 648 port switches, MT-M 8828-ED0, 8828-ED1, and 8828-ED2 Servicing the IBM POWER8 8335-GCA and 8335-GTA (IBM Power System S822LC) Servicing the Mellanox EDR 100 Gb InfiniBand 216 and 648 port switches, MT-M 8828-ED0 and 8828-ED2 Servicing the Mellanox EDR 100 Gb InfiniBand TOR switches MT-M 8828-E36 and 8828-E37 E-Service Training (EST) Qual-SAN Technology (CISCO) IBM 3584 Enhancements - HA1 and 3588-F3A - Service Training Tape & Tape Library Training Training on Linux Cluster V7K storage training TAS VSM Traning in Basel Switerzland Supercomputing conference Redhat Ansible workshop

Experience

14 yrs 2 mos
Total Experience
2 yrs 7 mos
Average Tenure
1 yr 3 mos
Current Experience

Biohub

Principal Engineer, HPC

Mar 2025Present · 1 yr 3 mos · San Francisco Bay Area · On-site

University of chicago, research computing center

Principal HPC Engineer

Nov 2020Mar 2025 · 4 yrs 4 mos · Chicago, Illinois, United States

  • New HPC System designing, development, and Operation
  • Installation/Deployment/maintenance/troubleshooting of the new & existing HPC hardware (CPU/GPU) system and software
  • Installation/Deployment/maintenance/troubleshooting the GPFS/Spectrum Scale, ZFS,
  • Installation/Deployment/maintenance/troubleshooting of the Backup and Archive solution
  • Installation/Deployment/maintenance/troubleshooting the Network (Ethernet& InfiniBand)
  • Installation/Deployment/maintenance/troubleshooting the Virtualization/ DevOPs/Hyperscale
  • Installation/Deployment/maintenance/troubleshooting the HPC security tool like Crowd-Strike, Rapid7 Carbon- black & Wazuh etc
  • Installation/Deployment/maintenance/troubleshooting: - Globus, SAMBA, CIFS, NFS, Ganesha
  • Installation/Deployment/maintenance/troubleshooting XCAT/Confluent, Slurm, Check-MK, ThinLinc, 2FA Duo Factor authentication, Various Kind of License servers, Git,
  • System Firewalld, Gpfs Snapshot policy, etc
  • Upgrade the HPC system firmware (Switch, Hardware), OS patch, Security Patch, etc
  • Creating scripts for automating daily routine tasks & health checks by using ansible
  • Coordinating & following different vendors during Hardware issues, deployment of the new HPC system Hardware, Software, and infrastructure-related issue resolution
  • Verify the vendor Benchmarking submitted results on different architectures like INTEL, AMD, (GROMACS, LAMMPS, QE MgO, QE Ice, HipMCL, HPCG, HPCC, IMB)
  • Providing day to day technical support to the users’
  • Maintaining/Writing/Update all system-related issues and operational documents on Github.
  • Manage different vendor’s HPC servers (IBM/DELL/Lenovo/NVIDIA/DDN/SM, GPUs)
HPC system designGPFS/Spectrum ScaleBackup and Archive solutionsNetwork troubleshootingVirtualization/DevOpsHPC security tools+4

Hewlett packard enterprise

Technical Support Manager

Jan 2020Oct 2020 · 9 mos · New Delhi, Delhi, India · On-site

  • Responsibilities: -
  • Manage team responsible for providing on-site hardware, systems, sub-systems, and/or other applications support for customers according to contractual service levels.
  • Plan, direct, and monitor operational/tactical activities of the team.
  • Meet business and operation targets.
  • Recruit and support the development of direct staff members.
  • Handle customer escalations.
  • Establish relationships with customers and other functional managers.
  • Provide guidance on process improvements and recommend changes in alignment with
  • Provide coaching and leadership to assigned field technicians.

Cray inc.

HPC Team Lead

Oct 2017Jan 2020 · 2 yrs 3 mos · Noida Area, India

  • CRAY XC40 HPC System: -
  • CRAY XC40 2.8 P/F HPC: - CRAY XC40 HPC (Intel Xeon Broadwell E5-2695 V4 18C 2.1
  • GHz with 128GB), connected through Cray Aries Dragonfly topology, Total Peak performances 2.8 P/F with 290 TiB memory, home 6.3 PB & 670 TB scratch storage & TFinity SPECTRA Logic Library (19 PB capacity with LTO 7)
  • Total 2320 compute nodes (83520 cores)
  • Responsibility: -
  • Managing & monitoring on-site installation, service & repair of Cray XC40 Cluster machine components Providing 24*7 Maintenance support of 2.8 petaflop HPC machine.
  • HPC Site Team lead of 2.8 P/F machine
  • Maintaining Site SLA & providing 24*7 support
  • Update the system firmware to the latest version (O.S, storage, switches & library, etc)
  • A part of the team member during the commissioning & installation of the Indian Biggest HPC Project
  • Managing the Datacenter infrastructure
  • Lead the localization during UAT & NOA
  • Managing user account and quota policy, production jobs & PBS queue as per client requirement.
  • Customize the Node image through BCM (Bright Cluster Manager)
  • Backup & migration of production data through TAS VSM using spectra Tfinity library
  • Involved in day-to-day maintenance & client requirement activity
  • Installation/management of tools like Matlab, CDT, IDL, SVN, FlexNet, etc.
  • Support weather forecasting models like UM, WRF, NCUM, and GFS
  • System configuration and management through Automation tools like Ansible and Puppet
  • Documentation of every system administrator procedure
  • Coordinate with scientists to change and implement the new models in system
  • Follow-up with infrastructure vendors
HPC system managementFirmware updatesUser account managementData center infrastructure managementHPC System ManagementData Center Management

Ibm

HPC Support Engineer

Mar 2013Sep 2017 · 4 yrs 6 mos · Noida Area, India

  • IBM IdataPlex 350 T/F HPC Cluster:-
  • Bhaskara IBM iDataPlex DX360M4(Xeon E5-2670 8C 2.6 GHz, Infini- Band FDR 14) High-performance computing system 350TF with 67 TB aggregate distributed shared Memory with 3 PB of Storage (India’s most powerful supercomputer in 2015). Total 1052 Computes nodes (16832 cores).
  • Responsibility:
  • Implementation and maintenance of 350T/F machine.
  • Installation and Managing General Parallel File System Cluster (GPFS).
  • Installation and configuration of Job Scheduler-IBM LSF.
  • Installation and configuration of Platform Application Central for Monitoring and GUI job submission.
  • Installation and Up-gradation of Scientific Application & Libraries on RHEL Cluster.
  • Porting Regional and Atmospheric Models (UM, NGEFS, WRF) on RHELCluster.
  • Creating, Modifying, and distributing User ID across Computing nodes via XCAT.
  • Install, maintain and support hardware and software components of the HPC environment.
  • Provide immediate support in implementation, troubleshooting, and maintenance of the HPC system.
  • Installation and configuration of intel cluster studio, IBM POE & required Modules.
  • Manage and troubleshoot TSM & HSM servers and clients.
  • Install and troubleshoot Mellanox InfiniBand FDR 14, Cisco MDS9148 Fabric Switches.
  • Troubleshooting of TS3500 & Ts1060 Model Tape Drives & library.
  • Manage and troubleshoot XCAT, LSF, IDL, and MATLAB Server applications.
  • Experiences in installation & system performance checking through LINPAC, BLAS, HPL, Intel MKL Bench-marking Application.
HPC cluster implementationJob Scheduler installationScientific application supportHPC SupportSystem Administration

Hcl infosystems ltd

HPC system admin

Jan 2012Feb 2013 · 1 yr 1 mo · Noida Area, India

  • IBM P6-575 22T/F HPC System: -
  • Environment: IBM P6-575 22TF HPC (Each Node with 32 cores of IBM P6 CPU 4.7 GHz with 32GB RAM), a Total of 42 Nodes connected through the InfiniBand network. Peak Speed ~ 22 TFs with ~500TB GPFS storage & IBM 3584 Tape library of 100TB capacity (LTO4)
  • Responsibility: -
  • System hardware monitoring, support & repair of IBM P6 (P575) based HPC components and other IBM P-series servers (p560, p570) using Hardware Management Console (HMC).
  • Log’s collection, identification, Disk Management & replacements in IBM Storage DS4k.
  • IB cable errors, issue tracing, replacements & troubleshooting in InfiniBand Q-logic Switches.
  • Production Job Monitoring, Scheduling & configuration using IBM workload scheduler: LoadLeveler.
  • AIX v5 & v6 Operating System related issues identification & troubleshoot.
  • Support Weather Forecasting models like UM, WRF, NCUM & GFS.
  • Data Backup & Migration using IBM TSM (Tivoli Storage Manager) in LTO4 Tape drives.
  • Troubleshooting Network & Booting problems, Monitoring & performance optimization.
  • User creation & Modification with file system quota.
  • Generating Daily, weekly & monthly reports of System utilization using the NMON tool.
System hardware monitoringData backup and migrationNetwork troubleshootingHPC System AdministrationData Management

Education

Maharishi Markandeshwar (Deemed to be University) Official

Bachelor of Engineering (B-Tech) — Computer Engineering

Jan 2007Jan 2010

Stackforce found 100+ more professionals with Hpc System Design & Network Administration

Explore similar profiles based on matching skills and experience