AI Testing– LLM Evaluation Engineer

Live Connections

full-time

Required skills

Python
accounting
AWS
Azure
GCP
JSON
microservices
TypeScript
version control
Vertex

About the role

Live Connections

Website: liveconnections.in
Job details:

AI Testing – LLM Evaluation Engineer

Experience: 4-10 yrs

CTC: Up to 18 LPA

Key Responsibilities

1) AI/LLM Evaluation & Test Design

· Define evaluation strategies (golden sets, adversarial suites, regressions), pass/fail gates, and SLOs for quality, safety, latency, and cost.

· Establish rubric-based human reviews (usefulness, faithfulness, safety, clarity) and calibrate annotators.

· Instrument LLM-as-judge where appropriate with calibration and spot checks.

2) RAG, Retrieval, & Grounding –

· Measure retrieval precision/recall, MRR/nDCG, and answer faithfulness to sources; detect hallucination and citation errors.

· Test chunking, prompt templates, filters, and policy chains; monitor stale/poisoned content.

3) Agentic & Tool-Use Scenarios

· Validate multi-step plans, tool selection, error recovery, retries, and idempotency for functions with side effects.

· Contract-test JSON schemas and structured outputs across services.

4) Team & Standards

· Adopt and improve test standards/methodology; share practices, train teams, participate in peer reviews, and pursue self-directed learning.

Required Qualifications

· 4-6 years in software coding with programming familiarity (e.g., Python/TypeScript) and experience with CI/CD and version control.

· Cloud basics (AWS/Azure/GCP) and microservices fundamentals.

· Degree/Diploma in CS/IT or equivalent.

Required (AI/ML Focus)

· Understanding of ML concepts and MLOps; experience with model validation and monitoring in production.

· Familiarity with evaluation/observability tools (any of): LangSmith, Weights & Biases, RAGAS, TruLens, Promptfoo, DeepEval, Guardrails/LlamaGuard, Presidio; plus OpenTelemetry-style LLM traces.

· Practical exposure to Azure OpenAI/Bedrock/Vertex and model gateways; quota & token accounting know-how.

Tooling & Automation (Preferred)

· Data evaluation pipelines for RAG (embedding validation, filtering, drift detection).

Traits

· Outcome-oriented, high standards; strong communication and collaboration; customer-focused; proficient in written and spoken English.

Telco Context (Nice-to-Have)

· Experience testing copilots/agents for BSS/OSS, NOC analytics, and enterprise care; ability to tie eval KPIs to CSAT, AHT, FCR, MTTR.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.