LLM Reliability & Evaluation Engineer

XenonStack

Location: Sahibzada Ajit Singh Nagar, Punjab, India
Job type: Full-time

Required skills

LangChain
Python
business objectives
compliance
NLP

About the role

XenonStack

Website: xenonstack.com
Job details:
ABOUT XENONSTACK XenonStack is the fastest-growing Data and AI Foundry for Agentic Systems , enabling enterprises to gain real-time and intelligent business insights

We deliver innovation through: Agentic Systems for AI Agents → akira.ai Vision AI Platform → xenonstack.ai Inference AI Infrastructure for Agentic Systems → nexastack.ai Our mission is to accelerate the world’s transition to AI

Human Intelligence by making AI agents reliable, explainable, and enterprise-ready

THE OPPORTUNITY We are seeking an LLM Reliability & Evaluation Engineer to ensure that large language models (LLMs) and agentic AI systems meet enterprise-grade standards of accuracy, safety, and trustworthiness
This role focuses on evaluating, benchmarking, and stress-testing LLMs in real-world workflows, building frameworks for reliability, robustness, and continuous improvement
If you thrive at the intersection of AI research, applied testing, and responsible deployment , this is the role for you. KEY RESPONSIBILITIES Evaluation Frameworks Design and implement LLM evaluation pipelines covering accuracy, robustness, safety, and bias. Develop automated systems for benchmarking models on enterprise-relevant tasks. Reliability Engineering Conduct stress tests, adversarial testing, and edge-case evaluations
Build tools to measure latency, consistency, and error recovery in multi-turn interactions. Metrics & Monitoring Define KPIs such as factual accuracy, hallucination rate, toxicity, and compliance alignment
Establish real-time monitoring for drift, anomalies, and performance regressions
Collaboration & Alignment Partner with ML engineers, product managers, and domain experts to align evaluation with business objectives. Work with Responsible AI teams to implement ethical, explainable, and compliant evaluation practices
Continuous Improvement Feed insights from evaluation into fine-tuning, RLHF/RLAIF pipelines, and model selection
Maintain a central repository of test cases, benchmarks, and evaluation results
Research & Innovation Stay current with state-of-the-art LLM evaluation techniques , from academic benchmarks to applied enterprise metrics. Explore automated evaluation using agentic test harnesses and synthetic data generation
SKILLS & QUALIFICATIONS Must-Have 3–6 years in AI/ML, NLP, or applied model evaluation
Strong understanding of LLM architectures, prompt engineering, and failure modes
Hands-on with evaluation frameworks (Eval harnesses, Ragas, OpenAI Evals, DeepEval). Proficiency in Python and libraries like LangChain, LangGraph, LlamaIndex, Hugging Face
Experience with vector databases, RAG pipelines, and knowledge graph integration
Familiarity with bias/fairness testing and Responsible AI frameworks
Good-to-Have Experience with reinforcement learning (RLHF, RLAIF) and reward modeling. Exposure to agentic evaluation frameworks (multi-agent stress testing, synthetic user simulators). Knowledge of compliance and safety requirements for BFSI, GRC, or SOC use cases. Contributions to open-source evaluation libraries or research papers
WHY SHOULD YOU JOIN US? Agentic AI Product Company Ensure reliability in cutting-edge AI platforms that are redefining enterprise adoption. A Fast-Growing Category Leader Be part of one of the fastest-growing AI Foundries , powering Fortune 500 enterprises with trustworthy AI. Career Mobility & Growth Grow into roles such as AI Systems Architect, Responsible AI Engineer, or Reliability Engineering Lead
Global Exposure Work on enterprise-scale evaluation challenges across BFSI, Healthcare, Telecom, and GRC. Create Real Impact Your evaluations will directly shape production-grade AI agents used in mission-critical systems
Culture of Excellence Our values — Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession — empower you to innovate fearlessly. Responsible AI First Join a company that prioritizes trustworthy, explainable, and compliant AI
XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT! At XenonStack, we believe in shaping the future of intelligent systems
We foster a culture of cultivation built on bold, human-centric leadership principles, where deep work, simplicity, and adoption define everything we do. Our Cultural Values Agency – Be self-directed and proactive. Taste – Sweat the details and build with precision. Ownership – Take responsibility for outcomes. Mastery – Commit to continuous learning and growth. Impatience – Move fast and embrace progress. Customer Obsession – Always put the customer first. Our Product Philosophy Obsessed with Adoption – Making AI accessible, reliable, and enterprise-ready. Obsessed with Simplicity – Turning complex evaluation challenges into seamless, automated frameworks. Be part of our mission to accelerate the world’s transition to AI

Human Intelligence — by making AI agents not just powerful, but trustworthy and reliable . Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.