CAW
Website:
cawstudios.com
Job details:
Senior QA Engineer — AI Systems
Team: KnackLabs Engineering Location: Hyderabad
Experience: 2–6 years The role AI systems break differently from traditional software. A regression isn't a stack trace — it's a 7% drop in answer relevance you spot three weeks late.
We need a QA engineer who treats this as an engineering problem: builds eval datasets before features ship, instruments traces before bugs surface, and reads OpenAPI specs before opening a UI.
You'll own quality across two production AI systems: - - Voice agents for regulated industries — Hindi-English code-switched conversations for banking, insurance, and similar use cases. Enterprise AI chat platform — agent orchestration, retrieval, and PII handling for regulated enterprise customers. Why this work - - - - Greenfield eval infrastructure for production AI — not maintaining legacy test suites for a mature product. Real stakes. Regulated industries, real customers, real money flows.
Hallucinations are not allowed. Embedded in design from day one — eval scores in PR descriptions before merge, not a downstream gate. Modern stack. Anthropic SDK in TypeScript, coding agents (Codex, Claude Code) in the loop, eval tooling actively evolving.
What you'll do - Build eval datasets and harnesses — golden sets, regression suites, LLM-as-judge harnesses (verified against human labels). - Design observability so anyone can answer "did the agent get worse this week?" in under 10 minutes, with a chart, not vibes. - - - Gate every prompt PR on automated eval scores — embedded in design, not a downstream gate. Partner with AI engineering on red-teaming — adversarial datasets for PII, jailbreaks, and prompt injection. Run load and chaos tests on async LLM pipelines. Success picture: by month 6, drift dashboards live, regression suite covering top conversational intents, AI engineers running their changes through your harness without being asked.
What we're looking for - Systems thinking over screen thinking. You reason about contracts, retries, latency, streaming, async not just what's on the page. Eval-first instinct. Asked to test a chatbot, you reach for a golden dataset, not Selenium. You write code. Not glue scripts — code that survives a senior engineer's review. You debug from telemetry.
You've found the root cause from logs and traces. You've killed a flaky test and have an opinion on why most flaky tests are actually bad tests. You work alongside coding agents (Codex, Claude Code) and review their output as critically as a human would.
On paper - 2–6 years in QA / SDET / SET / Quality Engineering, with at least 1.5–2 years on backend / API / systems testing. Strong fluency in any modern language — TypeScript, Java, Go, or Python. Language is not a barrier.
Modern test framework with non-trivial fixtures and plugins — Vitest / Jest / Pytest / JUnit + RestAssured / equivalent. One of: contract testing (Pact / Postman / Schemathesis), load testing (k6 / Locust / JMeter), distributed tracing (OpenTelemetry / Datadog / Honeycomb), CI test infrastructure.
Comfort reading backend code in PRs and using test management tooling (JIRA / Zephyr / TestRail). Bonus (genuinely bonus — not silent rejects) - Hands-on with LLM / RAG/agent/voice systems
Eval tooling: Langfuse, LangSmith, Phoenix, Braintrust, Ragas, DeepEval, OpenAI Evals Voice/telephony testing — call quality, latency, ASR/TTS evaluation Regulated-domain QA — PII, audit trails, compliance gates Hindi or other Indic language testing Open-source contributions in test or eval tooling Stack we use today - AI integration: Anthropic SDK in TypeScript, embedded in our existing application Eval/observability: Langfuse, LangSmith, OpenTelemetry, plus internal harnesses Languages: TypeScript preferred for AI app code; Python, Go, or Java elsewhere as the problem demands Coding assistants: Codex and Claude Code are part of normal development We hire on primitives — eval rigour, observability, contract literacy, failure-mode imagination. Tools turn over; primitives don't.
Click on Apply to know more.