Anshul Jain

CEO

Bengaluru, Karnataka, India12 yrs 8 mos experience

Highly Stable

Key Highlights

Expert in High Performance Computing and Site Reliability Engineering.
Proficient in managing complex HPC infrastructures globally.
Skilled in automation and monitoring solutions for HPC systems.

Stackforce AI infers this person is a High Performance Computing and Infrastructure specialist with strong SRE capabilities.

Contact

Skills

Core Skills

Site Reliability EngineeringHigh Performance Computing (hpc)DevopsSystem Engineering

Other Skills

AWSAltair PBS ProfessionalAmazon Web Services (AWS)AnsibleApache KafkaAristaAutomationCC++CCNACI/CDCassandraCentOSCisco TechnologiesComputer Hardware

About

HPC Engineer with blended skills of Network and System Engineering with Site Reliability Engineering along with System Analyst roles. Extensive practical hands-on experience in deploying High Performance Clusters. Working on below technologies: - Provisioning AWS Cloud Based HPC Cluster via AWS ParallelCluster and setting up ecosystem for it. - Low Latency InfiniBand Fabric HPC Network (Nvidia Mellanox IB & Intel OPA) . - HPC Benchmarking (HPL, HPCG, Stream, Netperf, HPL-AI) - xCAT, Slurm, IBM Spectrum GPFS, IBM Spectrum LSF, Lustre Storage. - Beegfs & Weka software defined storage. - ELK Stack (Elastic), TICK Stack and Prometheus for Infrastructure Monitoring. - Overall SRE Efforts for HPC Infrastructure. - Cumulus Linux OS & Ansible for Network automation. - High Speed Ethernet Networking solutions (Cisco, Aritsa & Cumulus). - Network Switch Configuration and topology desgin. - Expert in Grafana and Kibana visualization. - Mellanox UFM for detailed IB Monitoring. - Docker container setup for various stacks. - AWS Learner. - Python and Shell Script for task automation and data gathering. - PowerBI for report creation of HPC Usage & Benchmarking reports. - DevOps toolchain - Puppet, Ansible, GIT, Jenkins, JIRA etc. - Nagios/Namon configuration and management for Alerting. - NoSQL, InfluxDB, ElasticSearch, Vertica, Cassandra, MySQL, MongoDB , MariaDB Database.

Experience

12 yrs 8 mos

Total Experience

2 yrs 8 mos

Average Tenure

2 yrs

Current Experience

Analog devices

HPC Tech Lead

Jun 2024 – Present · 2 yrs · Bengaluru, Karnataka, India · Hybrid

Shell

2 roles

Senior HPC Engineer

Promoted

Jul 2022 – Jun 2024 · 1 yr 11 mos

Job Role is blend of Network Engineering and Site Reliability Engineering on top of HPC. Along with applications of Analytics and Business Intelligence. Handling multiple HPC clusters of different compute capacity located globally. Having multiple roles like :
Creation, Design and Operations of site reliability engineering (SRE) efforts on High Performance Computing systems using a variety of configuration management, IT monitoring, and automation tools.
Management, configuration, design of Infiniband Fabric (Mellanox and Intel OPA) for HPC Clusters.
Design and deploy automated networking solutions using Cumulus Linux OS & Ansible.
ELK Stack (Elasticsearch, Logstash, Kibana) and Prometheus Stack Design, Implementation, configuration, administration.
Design Monitoring solution for CI/CD Pipelines, Containers, Harbor, Jenkins for increasing its resiliency.
Designed Monitoring and reporting solution for Slurm.
Slurm Scheduler Configurations and administration.
Management and configuration of Mellanox UFM (Unified Fabric Manager) for detailed monioring and management of InfiniBand Fabric.
TICK Stack (Telegraf, Influxdb, Chronograf, Kapacitor) Design, configuration, implementation and administration.
Creation and design of Rack Layout and Network topology for HPC Clusters.
Learning AWS Cloud. Can create, configure and manage AWS Services such as Load Balancers,EC2, Lambda, S3, EFS, EBS, RDS, IAM, Security Group, Subnets, VPCs etc.
Developed an effective consistent SRE automation protocol based on HPC Roles like Network/Storage/Middleware/Operations etc.
Managing operations of HPC Grid Environment.
Leveraged Elastiflow for providing better visibility towards HPC High speed Network Performance.
Leveraged Elastalert for providing Alerting capability to ELK Stack cluster.
Write Nagios Plugins using Bash and Python. Design and configure Nagios Core setup.
Commissioning and Patching of HPC Computes and other Servers.

Network EngineeringSite Reliability EngineeringHPC Clusters ManagementInfiniband Fabric ManagementCumulus Linux OSAnsible+5

HPC Engineer

Oct 2018 – Jul 2022 · 3 yrs 9 mos

Nuance communications

HPC Operations Engineer

Jul 2016 – Sep 2018 · 2 yrs 2 mos · Pune Area, India

Worked in Nuance as HPC Operations Engineer. Having multiple roles
Managing operations of HPC Grid Environment.
UNIVA (SUN) Grid Engine Administration & configuration.
GPFS Storage Administration & configuration.
xCat Administration.
Configuration of DevOps tools like Puppet, GIT, Jenkins, JIRA etc.
Optimization using Shell Script, Python.
Swift OpenStack Administration.
Configuring Analytical tools like Kibana, Grafana, Graphite.
Nagios & other monitoring tool Core Level Configuration & implementation.
Handling databases like Vertica, Cassandra, Whisper, InfluxDB, MySQL etc.
Create various Analytics Dashboards on Grafana, Kibana, MicroStartergy.
Configure HP/Cisco Network, Linux System, Hypervisors, RedHat, CentOS, KVM.
Worked on Apache Spark Configurations.
NVIDIA GPU Administration & configuration.
JIRA Administration & configuration

HPC Grid ManagementUNIVA Grid EngineGPFS StoragexCat AdministrationDevOps Tools ConfigurationMonitoring Tools Configuration+3

Fis

System Engineer

Sep 2014 – Jul 2016 · 1 yr 10 mos · Pune Area, India

Primary job profile is to support Financial Application, Network, System as well as environment at the same time following ITIL process. Production Support on Linux Environment as well as change management are the main aspect of the project. Worked in 24x5 environments.
Good knowledge of Financial Application i.e Feed Handler used for delivering Market Data of multiple exchanges. Worked on deployment projects too.

Financial Application SupportNetwork SupportSystem SupportITIL ProcessLinux EnvironmentSystem Engineering