Nucleus AI
Website:
withnucleus.ai
Job details:
At Nucleus, progress in AI is inseparable from the quality of the systems that manage compute at the hardware frontier. We’re hiring a Software Engineer, GPU & HPC Infrastructure to own GPU fleet management, scheduling, and HPC tooling that support training and inference at scale. This role is centered on one of the most consequential layers of the stack: the infrastructure that determines how high-performance compute is provisioned, shared, monitored, and used efficiently across the company. You will work on the systems that translate scarce and powerful hardware into a dependable, high-throughput platform for research and production workloads.
What you’ll do- Build and operate infrastructure for GPU fleet management across training and inference environments.
- Design and improve scheduling systems that allocate GPU and HPC resources efficiently across competing workloads.
- Develop tooling for cluster utilization, job execution, quota management, and performance visibility.
- Improve reliability and operational workflows for large-scale GPU environments, including maintenance, recovery, and upgrade paths.
- Partner with research and infrastructure teams to support distributed training, inference serving, and compute-intensive experimentation.
- Optimize resource usage, throughput, and cost efficiency across high-performance compute systems.
- Contribute to low-level infrastructure and automation around drivers, runtimes, system images, and workload readiness.
- Help define long-term architecture for scaling GPU and HPC platforms as Nucleus grows.
What we’re looking for- Strong experience working on GPU infrastructure, HPC systems, or performance-critical distributed platforms.
- Familiarity with workload schedulers, resource managers, or cluster orchestration systems used in high-performance computing.
- Experience operating Linux-based compute systems at scale, including debugging hardware- and system-level issues.
- Understanding of GPU workloads, distributed training environments, and the practical constraints of shared compute platforms.
- Strong programming skills in Go, Python, Rust, C++, or similar systems-oriented languages.
- Experience with observability, performance tuning, and automation in compute-heavy environments.
- A thoughtful approach to balancing utilization, fairness, reliability, and developer ergonomics.
- Excitement about enabling frontier AI through better hardware and infrastructure systems.
Why NucleusAt Nucleus, GPU and HPC infrastructure is not a support function on the sidelines. It is part of how research happens, how models improve, and how production systems perform under real demand.
In this role, you’ll help shape the platform that turns high-performance compute into meaningful capability. Your work will influence how effectively we train, serve, and scale the systems at the center of our mission. If you’re energized by hard infrastructure problems at the intersection of hardware, systems, and AI, we’d love to hear from you.
Click on Apply to know more.