Kenny Sheridan

CEO

Seattle, WA, United States18 yrs experience
AI ML PractitionerAI Enabled

Key Highlights

  • Expert in building high-performance AI infrastructure.
  • Proficient in Rust for scalable compute solutions.
  • Former Marine with strong leadership and training skills.
Stackforce AI infers this person is a high-performance computing and AI infrastructure expert with a strong background in Rust and cloud technologies.

Contact

Skills

Core Skills

Infrastructure EngineeringAi InfrastructureSupercomputing EngineeringPerformance OptimizationAi/ml Infrastructure DesignVendor-agnostic SolutionsPerformance AutomationCloud InfrastructureHardware TestingInfrastructure ManagementSystem AdministrationIt SupportMeteorologyInstruction

Other Skills

Rust (Programming Language)VirtualizationHardware ArchitectureArtificial Intelligence (AI)System ArchitectureSupercomputingHigh Performance Computing (HPC)TerraformNetworkingGraphics Processing UnitSystem TestingMI300XResearch and Development (R&D)AMDROCm

About

After eight years as a Meteorologist in the U.S. Marine Corps, I moved into building reliable infrastructure systems for high-performance compute environments. My work centers on designing and implementing software systems that manage GPU-backed workloads across their full lifecycle, not on manual operations or ad-hoc administration.I design and build vendor-agnostic platforms spanning bare metal, virtual machines, and Kubernetes, with an emphasis on predictable performance and clear system behavior. This includes system onboarding, design, implementation, and validation for NVIDIA Hopper/Blackwell & AMD Instinct environments. A large part of my work involves co-designing management planes and host-level agents for asset discovery, inventorying, topology awareness, and cluster profiling. These systems encode infrastructure knowledge directly into software, allowing behavior to be reasoned about, tested, and automated rather than manually managed. I also build orchestration, deployment, and validation frameworks for GPU fleets, alongside distributed performance testing systems that measure networking, storage throughput and latency, and collective GPU operations (NCCL/RCCL/MPI). Then correlate infrastructure behavior directly to model training and inference performance, making regressions visible at the workload level. In addition, I build cloud-native developer tooling like CLIs and libraries that allow engineers to provision, inspect, validate, and extend infrastructure programmatically. The tools are designed to integrate with cloud-native workflows, emphasizing explicit state, repeatability, and testability. Alongside internal systems, I build and maintain open-source Rust libraries (crates) that my company relies on for vendor-agnostic infrastructure automation. These crates serve as core building blocks across provisioning, validation, performance testing, and developer tooling.I also spend a fair amount of time on-call, troubleshooting real production issues in code, debugging distributed behavior, performance regressions, and failure modes across compute, networking, databases, and storage. That feedback loop directly informs system designs and hardeningMost infrastructure software is written in rust paired with Nix/NixOS for reproducible builds and controlled rollout. Related work includes RDMA (RoCE), TCP/IP, gRPC, and hardware-aware system design.The goal is to treat infrastructure and software as explicit, measurable, and maintainable systems that scale cleanly across baremetal, VMs, and Kubernetes

Experience

Andromeda

Member of Technical Staff - Infrastructure Product

Mar 2026Present · 0 mo · Greater Seattle Area · Remote

  • Member of Technical Staff focused on building and scaling reproducible high-performance AI infrastructure that is reliably matched to model training and inference workloads
  • doing the following:
  • Bare-metal GPU orchestration at cluster scale
  • Virtualization and multi-tenant compute isolation
  • hardware lifecycle
  • Distributed systems and control planes
  • Performance engineering and real-world benchmarking
  • system benchmarking and modeling over time
  • workload matching and placing the right models on the right hardware
  • Developer-facing platforms that make large-scale infrastructure maintainable and profitable
Rust (Programming Language)VirtualizationHardware ArchitectureArtificial Intelligence (AI)System ArchitectureInfrastructure Engineering+1

San francisco compute company

Supercomputing Engineer

Aug 2024Mar 2026 · 1 yr 7 mos · Greater Seattle Area

  • As a Supercomputing Engineer at San Francisco Compute, I ensure compute clusters operate at peak efficiency, leveraging Rust exclusively to deliver high performance and production-grade confidence in our code. My role includes extensive performance testing of GPU accelerators, distributed storage, and heterogeneous compute environments, ensuring scalability and robustness for workloads, including financial transactions.
  • Responsibilities:
  • Manage and Optimize ML Training Clusters: Drive performance and reliability for machine learning clusters, integrating GPUs and specialized hardware, with a focus on Rust-based optimization.
  • Performance Testing at Scale: Conduct rigorous testing across GPU-accelerated and mixed compute environments to meet stringent performance and scalability standards.
  • Rust-Based Infrastructure and Orchestration: Our VM orchestrator is built in Rust, providing the high efficiency and security needed for large-scale compute and financial transaction handling.
  • Infrastructure Monitoring and Troubleshooting: Continuously monitor hardware health in real-time, troubleshoot, and implement swift solutions to minimize downtime.
  • Automate Infrastructure Management: Develop infrastructure as code using Rust, enabling scalable and reliable hardware management.
  • Predictive Maintenance: Apply data-driven approaches to anticipate and prevent system failures proactively.
  • Customer Engagement: Collaborate with customers to understand their needs and ensure our systems exceed expectations, particularly in financial applications requiring high security and performance.
SupercomputingRust (Programming Language)High Performance Computing (HPC)TerraformNetworkingSupercomputing Engineering+1

Tensorwave

Senior AI & HPC Infrastructure Engineer

May 2024Jul 2024 · 2 mos · Seattle metropolitan area, Washington, United States · Remote

  • Designed and implemented multi-node GPU clusters utilizing RDMA over Converged Ethernet (RoCE) on traditional TCP/IP networks, achieving an all_reduce bandwidth exceeding Nvidia's DGX's using RoCE switches from a non-Nvidia vendor. This solution scales across traditional TCP/IP networks, resulting in a lower total cost of ownership (TCO) without sacrificing performance due to the absence of Infiniband.
  • 🦀 Vendor-Agnostic Solutions: Developed a proof-of-concept solution that scales AI/ML infrastructure across hundreds to hundreds of thousands of nodes, compatible with both Nvidia and AMD accelerators, ensuring flexibility and avoiding vendor lock-in. Highlighting the ability to compose non-Nvidia multi-node GPU clusters.
  • Documentation and Reproducibility: Authored a step-by-step guide in Markdown and PDF formats for reliably reproducing the high-performance compute cluster setup and configurations.
  • Model Deployment Frameworks: Experienced in building and scaling model deployment frameworks like Burn, optimizing backend infrastructure for enterprise-scale AI/ML applications.
  • Event Co-Hosting: Co-hosted company-sponsored events such as Twitter Spaces to discuss advancements in HPC infrastructure in support of AI/ML and robotics efforts.
  • Hardware System Design: Proficient in system design, component selection, and iterative testing to ensure optimal performance.
  • Low-Latency Systems: Skilled in designing and implementing systems requiring ultra-low latency, crucial for AI, robotics, and HPC applications.
  • 🦀 Most Recent Production Project
  • Total CPUs: 8 (AMD EPYC 9754)
  • Total CPU Cores: 1024
  • Total CPU Threads: 2048
  • Total CPU L3 Cache: 3072 MB
  • Total CPU TDP: 2880 Watts
  • Total GPUs: 32 (AMD Instinct MI300X)
  • Total GPU Memory: 6144 GB
  • Total GPU Memory Bandwidth: 169.6 TB/s
  • Total GPU FP32 Performance: 41.6 PetaFLOPs
  • Networking: RoCE with 800G switching
Graphics Processing UnitSystem TestingMI300XHigh Performance Computing (HPC)NetworkingArtificial Intelligence (AI)+5

Servicenow

3 roles

Senior Hardware & Software Performance Automation Engineer

Feb 2022Nov 2023 · 1 yr 9 mos

  • As a Senior Infrastructure and Hardware Engineer at ServiceNow, I am dedicated to enhancing backend cloud hardware through innovative software development. My role encompasses developing performance-improving solutions, pioneering testing methods, and deploying enterprise hardware strategically.
  • Key Responsibilities and Achievements:
  • Backend Optimization: Lead initiatives to enhance ServiceNow’s backend cloud hardware efficiency and performance, involving system optimization and new technology development.
  • Python to Go Transition: Spearheaded the shift from Python to Go (Golang), focusing on intellectual property protection and performance enhancement.
  • Advanced Testing Methods: Developed and implemented sophisticated testing methodologies to ensure hardware reliability and efficiency.
  • Strategic Hardware Deployment: Managed the strategic implementation of enterprise hardware, ensuring seamless integration and minimal disruption.
  • Collaboration and Leadership: Worked closely with cross-functional teams, guiding projects, mentoring engineers, and shaping infrastructure decisions.
  • Industry Research and Improvement: Continuously researched cloud hardware and software advancements to maintain ServiceNow’s technological edge.
  • Infrastructure Scalability: Played a key role in scaling the infrastructure from 8,000 to over 400,000 servers, demonstrating significant growth management skills.
  • My role at ServiceNow leverages my deep expertise in software engineering, particularly in backend systems and cloud infrastructure, to drive technological innovation and operational efficiency in hardware asset management.
System ArchitectureBrand DevelopmentHardware ArchitectureSystem TestingModel TrainingSoftware Development+3

Senior Hardware Test Engineer

Promoted

Feb 2020Mar 2022 · 2 yrs 1 mo

  • In my capacity as a Senior Hardware Test Engineer, I've significantly contributed to leading initiatives in cloud infrastructure and storage technologies, enhancing collaboration and innovation.
  • Key Contributions and Roles:
  • System Test Deployment and Automation - Assisted in orchestrating and executing system tests for our cloud operation's bare-metal infrastructure, utilizing Go for automation.
  • Design and Development Collaboration - Actively involved in design discussions and requirements gathering, influencing cloud infrastructure solutions.
  • Engineering Team Partnerships - Worked with ODMs, system engineers, and design teams, aligning product roadmaps with business objectives.
  • Technology Research and Integration - Aided in the integration of new technologies, ensuring seamless compatibility with existing infrastructure.
  • Solution Testing and Stakeholder Engagement - Collaborated in adapting technical requirements and thoroughly testing new features.
  • Technical Debugging Support - Communicated effectively with CTOs and teams, supporting hardware component, BIOS, and firmware debugging.
  • Hardware and Storage Technology Expertise - Gained extensive experience with PCBs, storage technologies, networking, FPGA SSDs, and PCIe devices.
  • Storage Infrastructure Evolution - Contributed to transitioning from Fusion-io to standard NVMe and NVMe over Fabrics (NVMeoF).
  • Linux Filesystems Knowledge - Developed deep understanding of storage and Linux filesystems, focusing on system performance.
  • Team Growth and Agile Practices - Helped expand the hardware test engineering team, training engineers and a mid-level manager, implementing agile methodologies.
  • My role has been crucial in propelling the organization's cloud infrastructure, storage solutions, and team growth, achieving significant technological advancements.
Cloud ComputingHigh Performance Computing (HPC)Test AutomationHardware TestingPython (Programming Language)Cloud Infrastructure

Hardware Test Engineer

May 2017Feb 2020 · 2 yrs 9 mos

  • In my current role as a Hardware Test Engineer, I coordinate with computer hardware vendors to benchmark unreleased hardware and integrate new assets into the platform, enabling ServiceNow's SaaS applications to operate with extremely low latency.
  • Key Responsibilities:
  • Hardware Vendor Coordination: Work closely with hardware vendors to benchmark and evaluate unreleased hardware components.
  • Integration of New Assets: Integrate new hardware assets into the platform, enhancing ServiceNow's SaaS application performance.
  • Hardware Infrastructure Planning: Plan and manage the hardware infrastructure for all aspects of ServiceNow's platform at scale.
  • Server Hardware and Networking Tests: Conduct tests on server hardware and networking components for Quality of Service (QoS) and performance.
  • Configuration and Asset Tracking: Develop configuration items and asset tracking workflows to efficiently manage hardware resources.
  • Testing and QA Configurations: Create and manage testing and QA configurations for SSDs, PCIe, FPGAs, servers, and networked storage.
  • My role focuses on ensuring optimal performance and reliability of ServiceNow’s hardware infrastructure, contributing to the seamless operation of our SaaS offerings.
HardwareHigh Performance Computing (HPC)LinuxHardware TestingBashPython (Programming Language)+1

Nexlevel information technology

System Administrator

Aug 2015May 2017 · 1 yr 9 mos · Sacramento, California Area · On-site

  • In my role as a System Administrator, I provided critical onsite, Tier 3 support for a Unix/Windows server environment, crucial for biometric capture and analysis operations serving over 300 remote clients. My duties included system configuration planning and development, optimizing IT resources, and ensuring compliance with Service Level Agreements (SLAs).
  • Key Responsibilities:
  • Tier 3 Support for Server Environment: Delivered advanced technical support for Unix and Windows servers, crucial for biometric capture and analysis capabilities.
  • Support for Remote Clients: Managed and maintained systems for over 300 remote clients, ensuring operational efficiency and reliability.
  • System Configuration Planning: Involved in planning and developing baseline system configurations to optimize server performance.
  • Cost Savings Optimization: Developed solutions to ensure all IT resources efficiently met SLA requirements, optimizing cost savings.
  • SLA Compliance and Recovery: Successfully recovered and restored state records from a malfunctioning networked storage array, maintaining SLA with the State of California and avoiding significant fines.
  • Monitoring Script Creation: Developed a script to monitor the uptime of all production assets under SLA, effectively minimizing downtime from malfunctioning services.
  • Collaboration with Development and Support Teams: Assisted Development and Enterprise Support Services Teams in verifying that backend infrastructure was configured according to business requirements.
  • In this role, I played a key part in ensuring system stability and efficiency, while also contributing to cost optimization and SLA compliance, thereby supporting the organization's operational success and client satisfaction.
Operating SystemsHardwareSystem AdministrationLinuxHP-UXIT Support

United states marine corps

2 roles

Staff Meteorologist

Dec 2011Jun 2015 · 3 yrs 6 mos · North Carolina, United States · On-site

  • As a Staff Meteorologist, served as a Subject Matter Expert and instructor for Meteorological Information Technology at the Community College of the Air Force.
  • Taught graduate-level courses to multinational DoD and NATO students.
  • Integrated technological advancements into educational methods.
  • Career Highlights
  • Enhancement of Military Operational Readiness: Influenced strategic decision-making, enhancing readiness against environmental challenges.
  • Awards & Recognition
  • Navy and Marine Corps Achievement Medal: Recognized for exceptional performance in METOC, Marine Wing Support Squadron 374. Instrumental in exercises like Enhanced Mojave Viper and Weapons and Tactics Instructor courses.
  • Letter of Appreciation: Awarded for objective performance and commitment to excellence.
  • Additional Achievements
  • Subject Matter Expertise and Instruction: Specialized in teaching Meteorological Information Technology.
  • Virtualization of Classroom Environments: Implemented VMware thin clients for optimized classroom functionality.
  • Standard Operating Procedures Development: Authored comprehensive SOP manual, contributing to a 3% reduction in student attrition quarterly.
  • Extensive Curriculum Instruction: Delivered over 6,100 hours of education to 192 students, demonstrating instructional proficiency.
  • Course Revision and Improvement: Actively improved course content for enhanced educational quality.
ForecastingMeteorologyServersWeather RadarTrainingInstructor-led Training+1

Senior Meteorologist

May 2007Jun 2015 · 8 yrs 1 mo · North Carolina, United States · On-site

  • As a Senior Meteorologist in the Marine Corps, I managed meteorological systems and led training for junior Marines, involving IT infrastructure, network communications, and specialized meteorological analyses.
  • Key Responsibilities:
  • System Administration: Oversaw two modular data centers, handling software, environmental systems, and equipment requisition.
  • Leadership and Training: Trained and led 60 junior Marines, enhancing skills and operational readiness.
  • IT Infrastructure and Cloud Services: Configured IT systems for remote access to DoD cloud services and implemented a secure web portal for real-time data.
  • Remote Sensing Communications: Managed communications for remote sensing sites via WAN in Al-Anbar Province, Iraq.
  • Instruction and Expertise: Instructed multinational students in Meteorological Information Technologies for the US Department of Defense.
  • Radiosonde Observations and Forecasting: Conducted Radiosonde observations and 7-day forecasts in varied environments.
  • Climatology Presentations: Delivered climatology presentations internationally.
  • Impact Assessments: Created daily meteorological assessments for multiple assets.
  • Advanced Forecasting: Skilled in forecasting with or without computer models, essential in austere environments.
  • Technical Meteorological Skills: Proficient in Skew-T plotting and producing AIRMETS/SIGMETS.
  • My role involved applying a blend of technical knowledge and leadership to support meteorological operations and training, ensuring effective environmental assessment and support in diverse operational scenarios.

Education

Community College of the Air Force

Atmospheric Sciences and Meteorology

Jan 2007Jan 2012

Stackforce found 74 more professionals with Infrastructure Engineering & Ai Infrastructure

Explore similar profiles based on matching skills and experience