SDET 2

CAW

Location: Hyderabad, Telangana, India
Job type: Full-time

Required skills

Python
API
backend
Chart
compliance
Datadog
Gate
Java
Jest
Jira
JUnit
regression
Selenium
test infrastructure
TypeScript

About the role

CAW

Website: cawstudios.com
Job details:
Responsibilities

Build evaluation datasets and harnesses - golden sets, regression suites, and LLM-as-judge harnesses (verified against human labels).
Design observability so anyone can answer, "Did the agent get worse this week? " in under 10 minutes, with a chart, not vibes.
Gate every prompt PR on automated eval scores embedded in design, not as a downstream gate.
Partner with AI engineering on red-teaming adversarial datasets for PII, jailbreaks, and prompt injection.
Run load and chaos tests on async LLM pipelines.

Requirements

2-5 years in QA / SDET / SET/quality engineering, with at least 1-1.5 years on backend / API / systems testing.
Strong fluency in any modern language - TypeScript, Java, Go, or Python. Language is not a barrier.
Modern test framework with non-trivial fixtures and plugins: Vitest / Jest / Pytest / JUnit + RestAssured / equivalent.
One of the following: contract testing (Pact / Postman / Schemathesis), load testing (k6 / Locust / JMeter), distributed tracing (OpenTelemetry / Datadog / Honeycomb), or CI test infrastructure. Comfort reading backend code in PRs and using test management tooling (JIRA / Zephyr / TestRail). Bonus (genuinely a bonus, not silent rejects).
Hands-on with LLM / RAG/agent/voice systems Eval tooling: Langfuse, LangSmith, Phoenix, Braintrust, Ragas, DeepEval, OpenAI.
Evals Voice/Telephony Testing - call quality, latency, ASR/TTS evaluation. Regulated-domain QA - PII, audit trails, compliance gates, Hindi or other Indic language testing. Open-source contributions in test or eval tooling.
Stack we use today - AI integration: Anthropic SDK in TypeScript, embedded in our existing application. Eval/observability: Langfuse, LangSmith, OpenTelemetry, plus internal harnesses.
Languages: TypeScript preferred for AI app code; Python, Go, or Java elsewhere as the problem demands.
Coding assistants: Codex and Claude Code are part of normal development.
We hire on primitives - evaluation rigour, observability, contract literacy, and failure-mode imagination. Tools turn over; primitives don't.

What We're Looking For

Systems thinking over screen thinking. You reason about contracts, retries, latency, streaming, and async, not just what's on the page. Eval-first instinct. Asked to test a chatbot, you reach for a golden dataset, not Selenium.
You write code. Not glue scripts code that survives a senior engineer's review.
You debug from telemetry. You've found the root cause from logs and traces. You've killed a flaky test and have an opinion on why most flaky tests are actually bad tests.
You work alongside coding agents (Codex, Claude Code) and review their output as critically as a human would.

This job was posted by S M Nandakishore from CAW Studios. Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.