SDET 2CAWfull-timeRequired skillsPythonAPIbackendChartcomplianceDatadogGateJavaJestJiraJUnitregressionSeleniumtest infrastructureTypeScriptAbout the role CAW Website: cawstudios.com Job details: ResponsibilitiesBuild evaluation datasets and harnesses - golden sets, regression suites, and LLM-as-judge harnesses (verified against human labels).Design observability so anyone can answer, "Did the agent get worse this week? " in under 10 minutes, with a chart, not vibes.Gate every prompt PR on automated eval scores embedded in design, not as a downstream gate.Partner with AI engineering on red-teaming adversarial datasets for PII, jailbreaks, and prompt injection.Run load and chaos tests on async LLM pipelines.Requirements2-5 years in QA / SDET / SET/quality engineering, with at least 1-1.5 years on backend / API / systems testing.Strong fluency in any modern language - TypeScript, Java, Go, or Python. Language is not a barrier.Modern test framework with non-trivial fixtures and plugins: Vitest / Jest / Pytest / JUnit + RestAssured / equivalent.One of the following: contract testing (Pact / Postman / Schemathesis), load testing (k6 / Locust / JMeter), distributed tracing (OpenTelemetry / Datadog / Honeycomb), or CI test infrastructure. Comfort reading backend code in PRs and using test management tooling (JIRA / Zephyr / TestRail). Bonus (genuinely a bonus, not silent rejects).Hands-on with LLM / RAG/agent/voice systems Eval tooling: Langfuse, LangSmith, Phoenix, Braintrust, Ragas, DeepEval, OpenAI.Evals Voice/Telephony Testing - call quality, latency, ASR/TTS evaluation. Regulated-domain QA - PII, audit trails, compliance gates, Hindi or other Indic language testing. Open-source contributions in test or eval tooling.Stack we use today - AI integration: Anthropic SDK in TypeScript, embedded in our existing application. Eval/observability: Langfuse, LangSmith, OpenTelemetry, plus internal harnesses.Languages: TypeScript preferred for AI app code; Python, Go, or Java elsewhere as the problem demands.Coding assistants: Codex and Claude Code are part of normal development.We hire on primitives - evaluation rigour, observability, contract literacy, and failure-mode imagination. Tools turn over; primitives don't.What We're Looking ForSystems thinking over screen thinking. You reason about contracts, retries, latency, streaming, and async, not just what's on the page. Eval-first instinct. Asked to test a chatbot, you reach for a golden dataset, not Selenium.You write code. Not glue scripts code that survives a senior engineer's review.You debug from telemetry. You've found the root cause from logs and traces. You've killed a flaky test and have an opinion on why most flaky tests are actually bad tests.You work alongside coding agents (Codex, Claude Code) and review their output as critically as a human would.This job was posted by S M Nandakishore from CAW Studios. Click on Apply to know more. This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.