Website:
Job details:
Sr. Inference Engineer (LLM)
Location: Gurugram, India | Experience: 3–5+ years | Type: Full-time
About PrimaLabs
PrimaLabs builds self-driving AI infrastructure that automatically optimizes how machine learning workloads run on modern hardware. Our platform eliminates manual tuning by exploring billions of configurations across hardware, software, and model settings. By continuously learning from each deployment, PrimaLabs delivers systems that are self-aware, self-optimizing, and self-improving—enabling organizations to reduce costs, accelerate deployment, and maximize performance.
The Role
We're looking for an Inference Engineer to join our core systems team in Gurugram. You'll work at the intersection of ML, compilers, and accelerator hardware—building the runtime, kernels, and optimization layers that make large models run faster and cheaper at scale. Your work will directly shape the autonomous optimization engine that learns from every deployment and ships those wins back to customers.
This is a high-ownership role. You'll profile production workloads, find the bottlenecks others miss, and rewrite the hot paths—whether that's a fused attention kernel, a smarter KV-cache layout, or an entirely new scheduling strategy.
What You'll Do
•Optimize inference for LLMs and other large models across NVIDIA GPUs (and emerging accelerators)—improving throughput, latency, and cost per token.
•Write, profile, and tune high-performance kernels using CUDA, Triton, and modern compiler stacks (TVM, MLIR, XLA, or similar).
•Build and extend serving infrastructure: batching strategies, KV-cache management, speculative decoding, quantization, and tensor/pipeline parallelism.
•Integrate with and contribute to inference engines such as vLLM, TensorRT-LLM, SGLang, or TGI.
•Design experiments at scale to evaluate optimization strategies and feed results back into our autonomous tuning loop.
•Partner with research and platform teams to take new model architectures from prototype to production deployment.
•Own performance regressions end-to-end—from reproduction and root-cause to the fix that ships.
What We're Looking For
•3–5+ years of experience in ML systems, high-performance computing, compilers, or GPU programming.
•Strong programming skills in Python and C++; comfort reading and writing low-level code.
•Hands-on experience with GPU programming (CUDA, Triton, or similar) and an intuition for what makes kernels fast.
•Solid understanding of transformer architectures and modern inference techniques: continuous batching, paged attention, quantization (INT8/FP8/INT4), and speculative decoding.
•Experience profiling and optimizing real workloads using tools like Nsight, nvprof, or PyTorch Profiler.
•Familiarity with at least one production inference framework (vLLM, TensorRT-LLM, SGLang, TGI, or comparable).
•A bias toward measurement—you don't guess, you benchmark.
Bonus Points
•Contributions to open-source ML systems projects (PyTorch, vLLM, Triton, MLIR, etc.).
•Experience with compiler internals (MLIR, LLVM, TVM, XLA) or building custom passes.
•Background in distributed inference: tensor parallelism, pipeline parallelism, or disaggregated serving.
•Exposure to non-NVIDIA accelerators (AMD, AWS Trainium/Inferentia, Google TPU, Intel Gaudi).
•Experience with autotuning frameworks or applying ML to systems problems.
Click on Apply to know more.