Spydra - Performance Engineer

Spydra

Location: Bengaluru, Karnataka, India
Job type: Full-time

Required skills

Python
backend
CUDA
DevOps
end-to-end
Helm
proxy
regression

About the role

Spydra

Website: spydra.app
Job details:
Role Summary

Owns the inference serving stack as a performance engineer : Dynamo, vLLM, LLM guardrails, and the Envoy proxy chain that fronts the inference endpoint. Drives throughput, TTFT, and cost-per-token against SLO targets, and ships engine-level improvements, not just configuration.

What Youll Do

Engineer and tune the Dynamo-based model deployment engine backend selection, disaggregated prefill / decode, wide-EP, KV-cache routing.
Tune vLLM / SGLang runtime on AMD ROCm and NVIDIA GPUs batch shapes, attention kernels, CUDA / HIP graphs, paged-attention block size, chunked prefill.
Own quantisation policy (FP8 E4M3, INT4, AWQ / GPTQ) and the accuracy-vs-throughput trade-off per model family.
Build the Envoy filter chain that fronts the inference endpoint auth, rate-limit, request shaping, observability, retries, circuit-breaking.
Integrate LLM guardrails (llm-guard / NeMo Guardrails / open-source equivalents) for prompt filtering, PII redaction, jailbreak / toxicity detection, and policy enforcement at the edge.
Stand up and run the benchmark harness (AIPerf / locust-llm / custom) regression suites that gate every Dynamo / vLLM / guardrail release.
Profile end-to-end with nsys / rocprof / pyroscope; identify and eliminate stalls in the serving path.
Publish an SLO dashboard (TTFT p95, ITL p95, tokens / GPU-second, $/Mtok) and own it through launches.

Must Have

Strong hands-on with at least one inference runtime vLLM, SGLang, TensorRT-LLM, TGI, or Triton in production.
Working knowledge of transformer internals attention, KV cache, rotary embeddings, MoE routing, speculative decoding.
GPU profiling and kernel-level debugging (nsys / nvprof / rocprof / hip-clang). Comfortable reading CUDA / HIP code.
Envoy / service-mesh production experience rate-limit service, ext_authz, Wasm filters.
Python and C++; comfortable shipping patches upstream to vLLM / SGLang / Dynamo when needed.
Solid DevOps fundamentals containers, Helm, GitOps, CI/CD for model / engine releases.

Nice To Have

Prior work on ai-dynamo / NVIDIA Dynamo internals.
Experience with quantisation toolchains (AutoAWQ, GPTQ, TensorRT-LLM quantizers).
Familiarity with LLM guardrail frameworks (llm-guard, NeMo Guardrails, Rebuff, ProtectAI).
RCCL / NCCL / UCX collective tuning for tensor-parallel and expert-parallel workloads.
Background building cost-per-token dashboards and capacity models.

(ref:hirist.tech) Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.