Spydra
Website:
spydra.app
Job details:
Role Summary
Owns the inference serving stack as a performance engineer : Dynamo, vLLM, LLM guardrails, and the Envoy proxy chain that fronts the inference endpoint. Drives throughput, TTFT, and cost-per-token against SLO targets, and ships engine-level improvements, not just configuration.
What Youll Do
- Engineer and tune the Dynamo-based model deployment engine backend selection, disaggregated prefill / decode, wide-EP, KV-cache routing.
- Tune vLLM / SGLang runtime on AMD ROCm and NVIDIA GPUs batch shapes, attention kernels, CUDA / HIP graphs, paged-attention block size, chunked prefill.
- Own quantisation policy (FP8 E4M3, INT4, AWQ / GPTQ) and the accuracy-vs-throughput trade-off per model family.
- Build the Envoy filter chain that fronts the inference endpoint auth, rate-limit, request shaping, observability, retries, circuit-breaking.
- Integrate LLM guardrails (llm-guard / NeMo Guardrails / open-source equivalents) for prompt filtering, PII redaction, jailbreak / toxicity detection, and policy enforcement at the edge.
- Stand up and run the benchmark harness (AIPerf / locust-llm / custom) regression suites that gate every Dynamo / vLLM / guardrail release.
- Profile end-to-end with nsys / rocprof / pyroscope; identify and eliminate stalls in the serving path.
- Publish an SLO dashboard (TTFT p95, ITL p95, tokens / GPU-second, $/Mtok) and own it through launches.
Must Have
- Strong hands-on with at least one inference runtime vLLM, SGLang, TensorRT-LLM, TGI, or Triton in production.
- Working knowledge of transformer internals attention, KV cache, rotary embeddings, MoE routing, speculative decoding.
- GPU profiling and kernel-level debugging (nsys / nvprof / rocprof / hip-clang). Comfortable reading CUDA / HIP code.
- Envoy / service-mesh production experience rate-limit service, ext_authz, Wasm filters.
- Python and C++; comfortable shipping patches upstream to vLLM / SGLang / Dynamo when needed.
- Solid DevOps fundamentals containers, Helm, GitOps, CI/CD for model / engine releases.
Nice To Have
- Prior work on ai-dynamo / NVIDIA Dynamo internals.
- Experience with quantisation toolchains (AutoAWQ, GPTQ, TensorRT-LLM quantizers).
- Familiarity with LLM guardrail frameworks (llm-guard, NeMo Guardrails, Rebuff, ProtectAI).
- RCCL / NCCL / UCX collective tuning for tensor-parallel and expert-parallel workloads.
- Background building cost-per-token dashboards and capacity models.
(ref:hirist.tech)
Click on Apply to know more.