Associate Vice President - Senior Lead MLOPs Engineer [T500-25991]

Deutsche Börse

Location: Hyderabad, Telangana, India
Job type: Full-time

Required skills

Python
Backbone
capital markets
change management
continuous integration
compliance
DevOps
incident response
Kubeflow
machine learning
regression
Root Cause Analysis
version control
Vertex

About the role

Deutsche Börse

Website: deutsche-boerse.com
Job details:

About Deutsche Börse Group:

Headquartered in Frankfurt, Germany, Deutsche Börse Group is a leading international exchange organization and market infrastructure provider. They empower investors, financial institutions, and companies by facilitating access to global capital markets.

Their India centre is located in Hyderabad, serves as a key strategic hub and comprises India’s top-tier tech talent. They focus on crafting advanced IT solutions that elevate market infrastructure and services. Deutsche Börse Group in India is composed of a team of capital market engineers forming the backbone of financial markets worldwide.

Job description:

DBI - Clearstream Post Trade IT

As of now Deutsche Börse Group is seeking a MLOps Engineer – AI Run & Reliability (f/m/d)

Your area of work:

As an integral part of the Deutsche Börse Group, Clearstream offers settlement and custody services to more than 2,500 clients world-wide, covering over 300,000 domestic and internationally traded bonds and equities. Clear stream's core business ensures that cash and securities are promptly and effectively delivered between parties, and that clients are always notified of the rights and obligations attached to the securities they keep under our custody. Thanks to its committed staff, Deutsche Börse Group has developed into one of the most modern exchange organisations in the world. More than 10,000 employees work for the Group – a dynamic, highly motivated and international team.

Your responsibilities:

Own day to day operations for multiple AI services (LLM apps, ML models, RAG services, agents), meeting SLO/SLA targets for availability, latency, cost, and quality targets.
Maintain operational documentation (runbooks, escalation matrices) and participate in the on-call schedule.
Coordinate incident response including triage, mitigation, and root cause analysis.
Manage releases and deployments (including canary deployments and blue/green rollouts) and make sure test and production environments remain aligned.
Follow the company’s formal change management process and maintain required documentation for audits.
Design and maintain full observability across AI systems — covering both service health metrics and model quality indicators.
Track and act on KPIs such as latency, throughput, error rates, safety filter triggers, factual accuracy issues, user acceptance rates, and operational cost per 1K tokens.
Monitor data and model drift and take action when behaviour changes.
Build dashboards and alerts; adjust thresholds based on trends.
Run A/B tests and use controlled side-by-side evaluations to validate improvements.
Operate and improve continuous integration and continuous delivery workflows for machine learning and large language model components, including pipelines, model registries, and version control.
Trigger retraining or prompt updates when quality drops, new data becomes available, or policies change. Coordinate any required data labeling tasks.
Manage rollbacks and controlled promotions between models to ensure reliability and auditability.
Maintain full traceability for decisions, model versions, and deployments.
Ensure compliance with data protection rules, regional data storage requirements, and secure secret handling.
Integrate AI-safety measures such as content filtering and misuse detection.
Maintain documentation describing how each model works, how it was trained, and any risks or limitations.
Collaborate with Security, Risk, and Governance teams to provide required evidence for audits and reviews.
Work closely with testing teams on release preparation, environment planning, and resilience testing.
Contribute AI-specific checks such as prompt quality regression tests, bias/safety evaluations, and retrieval quality assessments.
Collaborate with data engineering, application engineering, and AI-engineering teams to optimize operations and control costs.
Help build shared components, templates, and best practices that other AI teams can reuse.
Ensure all operational and risk related documentation is kept up to date.

Your profile:

Experience in Site Reliability Engineering, DevOps, Test Operations, Platform Engineering, or MLOps — ideally with responsibility for production systems.
Hands-on experience with container orchestration platforms such as OpenShift or Kubernetes; CI/CD tools; and observability solutions (metrics, logs, traces, dashboards).
Practical experience operating machine learning or large language model workloads: serving models, running pipelines, maintaining model registries, and evaluating performance.
Strong understanding of evaluating ML/LLM quality and safety, detecting drift, running A/B tests, and managing controlled deployments.
Proficiency in Python and familiarity with one ML platform (MLflow, Vertex AI, Kubeflow, etc.).
Ability to communicate clearly and produce high quality documentation (runbooks, change requests, incident reports).
Experience with retrieval-based AI systems (vector databases, retrieval pipelines), prompt engineering, or multi step AI agents.
Background in AI security or risk assessments (safety testing, privacy, auditability).
Knowledge of enterprise change management processes and business continuity testing.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.