About the role
We're hiring an ML Optimization Engineer to make sure our algorithms run at peak performance on our current hardware (Nvidia Jetson) and help shape how we deploy on future platforms. You are the expert on CUDA, optimizing our models for real-time performance, and making sure we're squeezing every bit of efficiency out of our hardware. You know how to make the most of a DLA core, and how to squeeze every bit of performance from a warp - maximizing occupancy, minimizing numeric issues and ensuring an efficient, consistent power profile. You understand the nuances of thread-block scheduling, and can use this to determine whether a tool like TensorRT is the right fit, or if we need to scrap it altogether.
This role is all about high-performance implementation. You'll work closely with research and engineering teams, taking cutting-edge ML concepts and making them run faster, smoother, and smarter on our embedded hardware. This starts with our CUDA-based platforms, but extends to other edge accelerators, like FPGA. Your insights will directly influence not just our deployment strategies but also the direction of our research.
What you'll do:
Optimize machine learning algorithms for efficient execution on Jetson and other embedded platforms.
Work with CUDA to fine-tune performance at the hardware level.
Collaborate with ML research, providing feedback on model architectures and suggesting changes to better align with hardware constraints.
Work closely with our embedded software team to stay ahead of hardware advancements, ensuring our deployment strategies are future-proof.
Own the high-performance implementation of our ML models, ensuring they work seamlessly within the constraints of embedded environments.
Grow fast with real opportunities – We'll keep expanding your scope and giving you bigger challenges to help you reach your goals. If you don't know exactly what role you want to grow into, you'll have the freedom to take on different responsibilities and find the right path.
Who we're looking for (every role):
Fast learners over specific backgrounds – We care more about how quickly you can pick up new skills than where you've worked before.
Intellectual honesty – The right answer matters more than being right. You challenge assumptions, test ideas, and pivot when needed.
Adaptability – We're organized, but sometimes things change quickly. You find a way to make it work and balance short term deliverables with long term goals.
Ownership of outcomes – You optimize your own time, focus on what matters to deliver quickly, and cut out inefficiencies.
Not building in a vacuum – You stay connected to the rest of our teams and our customers to make sure all the pieces fit together.
Who we're looking for (this role):
CUDA expert with deep experience optimizing ML workloads for embedded or edge hardware.
Strong understanding of TensorRT—ideally, you know its limitations inside and out and have ideas for something better.
Proficient in low-level optimizations and frameworks that push ML models to their peak efficiency.
Comfortable navigating the intersection of ML theory and real-world implementation—you understand the math but care most about making it run fast
Experience working with Jetson or similar platforms
Excited to take on new and challenging accelerators - including guidance on model development very close to the metal
A natural problem solver who thrives on squeezing performance out of complex systems