XenonStack
Website:
xenonstack.com
Job details:
ABOUT XENONSTACK XenonStack is the fastest-growing Data and AI Foundry for Agentic Systems , enabling enterprises to gain real-time and intelligent business insights
- We deliver innovation through: Agentic Systems for AI Agents → akira.ai Vision AI Platform → xenonstack.ai Inference AI Infrastructure for Agentic Systems → nexastack.ai Our mission is to accelerate the world’s transition to AI
Human Intelligence by making AI agents reliable, explainable, and enterprise-ready
- THE OPPORTUNITY We are seeking an LLM Reliability & Evaluation Engineer to ensure that large language models (LLMs) and agentic AI systems meet enterprise-grade standards of accuracy, safety, and trustworthiness
- This role focuses on evaluating, benchmarking, and stress-testing LLMs in real-world workflows, building frameworks for reliability, robustness, and continuous improvement
- If you thrive at the intersection of AI research, applied testing, and responsible deployment , this is the role for you. KEY RESPONSIBILITIES Evaluation Frameworks Design and implement LLM evaluation pipelines covering accuracy, robustness, safety, and bias. Develop automated systems for benchmarking models on enterprise-relevant tasks. Reliability Engineering Conduct stress tests, adversarial testing, and edge-case evaluations
- Build tools to measure latency, consistency, and error recovery in multi-turn interactions. Metrics & Monitoring Define KPIs such as factual accuracy, hallucination rate, toxicity, and compliance alignment
- Establish real-time monitoring for drift, anomalies, and performance regressions
- Collaboration & Alignment Partner with ML engineers, product managers, and domain experts to align evaluation with business objectives. Work with Responsible AI teams to implement ethical, explainable, and compliant evaluation practices
- Continuous Improvement Feed insights from evaluation into fine-tuning, RLHF/RLAIF pipelines, and model selection
- Maintain a central repository of test cases, benchmarks, and evaluation results
- Research & Innovation Stay current with state-of-the-art LLM evaluation techniques , from academic benchmarks to applied enterprise metrics. Explore automated evaluation using agentic test harnesses and synthetic data generation
- SKILLS & QUALIFICATIONS Must-Have 3–6 years in AI/ML, NLP, or applied model evaluation
- Strong understanding of LLM architectures, prompt engineering, and failure modes
- Hands-on with evaluation frameworks (Eval harnesses, Ragas, OpenAI Evals, DeepEval). Proficiency in Python and libraries like LangChain, LangGraph, LlamaIndex, Hugging Face
- Experience with vector databases, RAG pipelines, and knowledge graph integration
- Familiarity with bias/fairness testing and Responsible AI frameworks
- Good-to-Have Experience with reinforcement learning (RLHF, RLAIF) and reward modeling. Exposure to agentic evaluation frameworks (multi-agent stress testing, synthetic user simulators). Knowledge of compliance and safety requirements for BFSI, GRC, or SOC use cases. Contributions to open-source evaluation libraries or research papers
- WHY SHOULD YOU JOIN US? Agentic AI Product Company Ensure reliability in cutting-edge AI platforms that are redefining enterprise adoption. A Fast-Growing Category Leader Be part of one of the fastest-growing AI Foundries , powering Fortune 500 enterprises with trustworthy AI. Career Mobility & Growth Grow into roles such as AI Systems Architect, Responsible AI Engineer, or Reliability Engineering Lead
- Global Exposure Work on enterprise-scale evaluation challenges across BFSI, Healthcare, Telecom, and GRC. Create Real Impact Your evaluations will directly shape production-grade AI agents used in mission-critical systems
- Culture of Excellence Our values — Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession — empower you to innovate fearlessly. Responsible AI First Join a company that prioritizes trustworthy, explainable, and compliant AI
- XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT! At XenonStack, we believe in shaping the future of intelligent systems
- We foster a culture of cultivation built on bold, human-centric leadership principles, where deep work, simplicity, and adoption define everything we do. Our Cultural Values Agency – Be self-directed and proactive. Taste – Sweat the details and build with precision. Ownership – Take responsibility for outcomes. Mastery – Commit to continuous learning and growth. Impatience – Move fast and embrace progress. Customer Obsession – Always put the customer first. Our Product Philosophy Obsessed with Adoption – Making AI accessible, reliable, and enterprise-ready. Obsessed with Simplicity – Turning complex evaluation challenges into seamless, automated frameworks. Be part of our mission to accelerate the world’s transition to AI
Human Intelligence — by making AI agents not just powerful, but trustworthy and reliable .
Click on Apply to know more.