Subrahmanya Joshi

Software Engineer

Bengaluru, Karnataka, India6 yrs 6 mos experience
Most Likely To SwitchAI Enabled

Key Highlights

  • 6.5+ years in AI and HPC solutions.
  • Led multiple Generative AI projects at HPE.
  • Google Cloud Certified Professional in ML and Data Engineering.
Stackforce AI infers this person is a SaaS AI Engineer specializing in high-performance computing and machine learning.

Contact

Skills

Core Skills

Artificial Intelligence (ai)Machine LearningAnomaly DetectionArtificial Intelligence For It Operations (aiops)Data Science

Other Skills

Time Series AnalysisLinuxPyTorchDeep LearningRetrieval-Augmented Generation (RAG)Multi-agent SystemsOpenVINORay TunePredictive MaintenanceResearch SkillsKubernetesLarge Language Models (LLM)PythonFine TuningNvidia TensorRT

About

I am a Software Engineer at Hewlett Packard Enterprise (HPE) with 6.5+ years of experience in developing and deploying end-to-end AI solutions. My work primarily focuses on leveraging AI to improve the reliability and efficiency of high-performance computing (HPC) datacenters by building systems that monitor, detect, and predict anomalies and failures (AIOps). I have designed AI-powered tools to predict job failures, analyze system logs for anomalies, and correlate alerts, enabling optimized operations at scale. In addition, I lead multiple Generative AI projects within HPE-HPC, applying GenAI to automate HPC diagnostics, enhance system intelligence, and improve operational efficiency in datacenters. I am also a core contributor on the team developing an agentic AI powered copilot for the next generation of HPC management software. I am a Google Cloud Certified Professional Machine Learning and Data Engineer, with hands-on experience in designing and deploying production-grade machine learning systems on Google Cloud Platform (GCP).

Experience

Hewlett packard enterprise

4 roles

Software Engineer lll

Promoted

Feb 2024Present · 2 yrs 1 mo

  • Authored and presented a paper on AI-driven datacenter failure prediction at CUG2024, engaging with global customers and industry experts in Perth, Australia.
  • Authored a paper on the sustainability impact of AIOps in datacenters, shortlisted for presentation and publication at the Sustainable Supercomputing Workshop, part of the SC24 conference.
  • Authored and presented multiple papers at HPE internal technical conferences.
  • Led a team to build a RAG based chat interface for product documentation, improving information accessibility.
  • Designed and implemented a machine learning model to predict HPC job failures at the submission phase, reducing wasted compute cycles.
  • Achieved 3× improvement in model inference throughput using hardware accelerators. (TensorRT and OpenVINO).
  • Led development of AI-powered HPC diagnostics solutions enhancing customer reliability and support experiences.
  • Built a highly scalable real-time log anomaly detection system for monitoring datacenter system logs using AI/ML.
  • Developing an intelligent alert correlation engine that clusters related datacenter alerts, improving incident response.
  • Leading design and development of an advanced multi-agent log analytics platform for autonomous system monitoring.
  • Enhancing and maintaining AIOps pipelines deployed in customer datacenters, to ensure reliability and continuous improvements.
  • Developing an agentic AI-powered copilot system for next-generation HPC system management software.
Artificial Intelligence (AI)Time Series AnalysisLinuxPyTorchDeep LearningRetrieval-Augmented Generation (RAG)+15

Software Engineer II

Promoted

Feb 2022Feb 2024 · 2 yrs

  • Partnered with customers to enhance and optimize machine learning pipelines deployed in production environments.
  • Designed and deployed an AI-powered predictive tool to forecast datacenter failures, enabling proactive mitigation.
  • Built a POC anomaly detection system leveraging system log analytics to identify abnormal behaviors.
  • Contributed to transforming AIOps into a cloud-native service, improving scalability and accessibility.
  • Developed a machine learning-driven forecasting tool for cluster diagnostic results, improving operational efficiency.
  • Authored and presented multiple papers at HPE internal technical conferences.
GrafanaKerasTime Series AnalysisConfluent KafkaData ScienceLinux+17

Software Engineer I

Aug 2019Jan 2022 · 2 yrs 5 mos

  • Developed an end-to-end AI system to detect anomalies across thousands of datacenter sensors with high efficiency.
  • Validated the solution on customer-provided historical data, achieving:
  • o Early detection of anomalies up to 5 minutes before customer-reported incidents.
  • o Identification of 50% more anomalies than existing systems.
  • o Demonstrated potential to prevent 40% of high-priority incidents that could escalate into critical events.
  • Integrated AIOps with HPE and Cray’s cluster management platforms, enabling proactive monitoring for some of the world’s fastest supercomputers.
GrafanaKerasTime Series AnalysisConfluent KafkaData ScienceLinux+13

Research Intern

Jan 2019Jul 2019 · 6 mos

  • • Developed statistical and deep learning models for anomaly detection on large-scale datacenter sensor telemetry datasets (tens of millions of records) provided by our customers.
Time Series AnalysisData ScienceLinuxSoftware DevelopmentPythonTensorFlow+2

Education

Sri Jayachamarajendra College Of Engineering

BE - Bachelor of Engineering — Computer Science

Jan 2015Jan 2019

Stackforce found 100+ more professionals with Artificial Intelligence (ai) & Machine Learning

Explore similar profiles based on matching skills and experience