Senior Machine Learning Engineer

EXL

full-time

Required skills

LangChain
Python
Airflow
AWS
Apache
Apache Airflow
Azure
data ingestion
DevOps
Docker
end-to-end
GitHub
GPU
Kubeflow
Kubernetes
TensorFlow
Pytorch
Vertex

About the role

EXL

Website: exlservice.com
Job details:

About the Company

We are looking for an experienced LLM Ops Engineer to own the end-to-end lifecycle of LLM applications in production - from model selection and pipeline design through fine-tuning, deployment, observability, and continuous improvement. This role sits at the intersection of ML Engineering, DevOps, and Data Engineering, and is critical to ensuring that GenAI systems are reliable, cost-efficient, and scalable in enterprise environments. You will partner closely with AI Research, Product, Platform, and Data Engineering teams.

About the Role

We are looking for an experienced LLM Ops Engineer to own the end-to-end lifecycle of LLM applications in production.

Responsibilities

Design, build, and maintain end-to-end LLM pipelines - from data ingestion and pre-processing through model training, fine-tuning, and deployment into production.
Implement and manage CI/CD pipelines for ML/LLM workflows using tools such as MLflow, Kubeflow, GitHub Actions, etc., ensuring reproducibility and fast iteration cycles.
Own model lifecycle management: versioning, A/B testing, canary deployments, rollbacks, and governance - ensuring models are always production-safe.
Architect and operate LLM serving infrastructure on cloud or on-premises with high availability, low latency, and cost efficiency.
Build robust monitoring, observability, and alerting frameworks for model drift, hallucinations, latency, token costs, and quality regressions (LangSmith, Weights & Biases, others).
Experience with RAG pipelines with vector databases, drive model fine-tuning initiatives for domain-specific applications.
Establish and enforce LLMOps best practices including prompt versioning, evaluation frameworks, guardrails, PII policies, and audit trails.
Manage AI Gateway and model routing across multiple LLM providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Vertex AI) with unified auth, rate limiting, and fallback logic.
Optimise inference costs through quantisation, batching strategies, hardware (GPU/TPU) optimisation, and model compression.
Mentor junior engineers and contribute to internal documentation, and platform tooling.

Qualifications

B.Tech / M.Tech in CS, AI/ML, Mathematics or equivalent.

Required Skills

Languages: Python (advanced)
Frameworks: LangChain, LangGraph, Hugging Face, PyTorch, TensorFlow
MLOps / Pipeline Tools: MLflow, Kubeflow, Apache Airflow, Prefect
DevOps / Infra: Docker, Kubernetes, GitHub Actions
Cloud Platforms: AWS Bedrock, Azure OpenAI, Google Vertex AI

Preferred Skills

Experience with RAG & Vector DBs, Fine tuning (LoRA, PEFT), LLM Observability (LangSmith, Weights & Biases, others), prompt evaluation.
Good to have: Security governance (LLM red-teaming, PII redaction, AI safety guardrails), streaming (event driven architecture).

Pay range and compensation package

6 – 10+ Years Overall in software / ML engineering

3+ Years Hands-on production LLM/ML lifecycle

Equal Opportunity Statement

We are committed to diversity and inclusivity.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.