Principal AI/ML Engineer

Sabre Corporation

Location: Bengaluru, Karnataka, India
Job type: Full-time

Required skills

Python
Apigee
BigQuery
caching
capacity planning
compliance
cross-functional
data ingestion
Dataflow
end-to-end
GCP
Google Cloud
infrastructure-as-code
multi-tenant
SRE
Terraform
TypeScript
user feedback
Vertex

About the role

Sabre Corporation

Website: sabre.com
Job details:
The Principal AI/ML Engineer is the technical leader responsible for designing, building, and scaling AI systems that combine LLM-powered GenAI and ADK-based agentic workflows on Google Cloud Platform. This role sets architecture standards, leads multi-team delivery, and governs safety, reliability, builds and manages the platform, and cost at enterprise scale—accelerating product teams to achieve 10× productivity through reusable patterns, platforms, and guardrails.

Key Responsibilities

Strategy & Architecture

Define reference architectures for GenAI apps, RAG systems, and agent ecosystems (single/multi-agent) on GCP using ADK.
Establish domain and platform standards: model selection, RAG/generation patterns, memory architectures, security baselines, observability, and LLMOps.
Lead portfolio-wide technical decisions (build/buy, vendor selection, SLAs, quotas) with a focus on reliability, safety, and cost control.

Solution Design & Delivery

Architect and lead implementation of production-grade GenAI solutions (Vertex AI models, Grounding, Pipelines, Evaluation) and agentic services (planning, tools, memory, HIL).
Design multi-tenant and hub-and-spoke patterns with Okta/IAP/Apigee for secure API exposure and tenant isolation.
Drive end-to-end delivery across teams: data ingestion (Dataflow/Composer), indexing (BigQuery vectors/Vertex Vector Search), services (Cloud Run/Workflows), events (Pub/Sub).

Platformization & Reuse

Build and maintain prompt libraries, tool catalogs, agent templates, and evaluation harnesses for organization-wide reuse.
Standardize LLMOps: CI/CD for prompts/models/agents, model registry, traceability, rollback, canaries, cost/performance scorecards.
Enable a marketplace of agents/services with productized APIs, documentation, chargeback, and KPIs.

Responsible AI, Security & Compliance

Implement multi-layer guardrails: policy prompts, filters, memory governance, tool whitelisting, audit logs; ensure regulator-ready posture.
Codify privacy, PII handling, data residency, and per-tenant isolation using VPC-SC, Secret Manager, IAM, and Apigee policies.

Leadership & Enablement

Mentor senior engineers and team leads; run architecture reviews, design clinics, and red-team exercises.
Drive continuous evaluation programs and publish org scorecards for quality, safety, and cost.
Partner with Product, Security, and SRE to align roadmaps, SLOs, and operational playbooks.

Required Technical Competencies

LLM & GenAI: Model selection (Gemini & Model Garden), prompt engineering, RAG/grounding, multimodal pipelines, fine-tuning/adapter methods.
Agentic AI (ADK): Agent loops, planners, tool/function design, memory (episodic/semantic/long-term), HIL, policy enforcement.
Data & Retrieval: BigQuery (including vector functions), Vertex Vector Search, Document AI, Dataplex for lineage and governance.
Orchestration & Services: Cloud Run, Workflows, Pub/Sub, Dataflow/Composer; HA/DR, backpressure, circuit breakers.
LLMOps/MLOps: Vertex AI Pipelines, registry, CI/CD, trace correlation, cost/performance monitoring.
Security & Compliance: IAM, Secret Manager, VPC-SC, private service connect, DLP, Okta/IAP, Apigee API policies.
Observability & Cost: Central telemetry, user feedback loops, drift/outlier detection, quota/capacity planning.

Qualifications

12–15+ years in software/data/ML engineering; 2+ years hands-on with LLMs/GenAI and agentic systems.
Proven delivery of enterprise-scale GenAI/agent platforms on GCP (Vertex AI, BigQuery, Cloud Run, Pub/Sub, Workflows).
Demonstrated impact in platformization, governance, and multi-team technical leadership.
Strong proficiency in Python/TypeScript (or equivalent) and infrastructure-as-code (Terraform/GCP Deployment Manager).
Experience in security-by-design, privacy, and compliance audits.

Outcomes & KPIs (What “Great” Looks Like)

Reliability: SLOs met (e.g., p95 latency, error budget adherence); audited HA/DR playbooks; zero Sev1 incidents due to preventable guardrail gaps.
Quality & Safety: Sustained improvements on faithfulness/toxicity/grounding scores; red-team findings resolved within agreed SLAs.
Cost & Performance: ≥ 30% reduction in run-cost via routing, caching, and prompt/template optimization; budget adherence per tenant.
Productivity & Reuse: ≥ 50% reuse of tools/templates across teams; time-to-market reduced by ~40% for new AI features.
Adoption & Enablement: ≥ 3 cross-domain AI capabilities launched per quarter; engineers enabled through patterns and training.

Core Responsibilities (Day-to-Day)

Own reference architectures and standards for GenAI and Agentic AI on GCP.
Lead design reviews and production readiness assessments.
Curate and evolve prompt/agent/tool libraries with versioning and documentation.
Establish evaluation harnesses (golden sets, scenario tests, trace replay, chaos for agents).
Partner with SRE/Platform to implement observability, alerts, feature flags, canaries, and rollback mechanisms.
Drive security reviews, policy-as-code, and auditability for all AI systems.

Demonstrated Behaviors (Principal Level)

Technical Leadership

Systems thinking: Anticipates failure modes, cost implications, and long-term maintenance; makes reversible vs. irreversible decision trade-offs explicit.
Pragmatic innovation: Balances cutting-edge methods (e.g., learned planners, multimodal grounding) with operational simplicity and reliability.
Platform-first mindset: Designs for reuse; evangelizes patterns; prevents bespoke one-offs unless clearly justified.

Execution Excellence

Outcome orientation: Frames problems with clear KPIs; selects the simplest architecture that satisfies reliability, safety, and cost.
Bias to automation: Converts manual steps into workflows, CI/CD pipelines, and platform capabilities; eliminates toil.
Operational rigor: Treats prompts/models/agents as versioned production artifacts with runbooks and guardrails.

Collaboration & Influence

Cross-functional partnering: Brings Product, Security, SRE, and Data together to align goals and reduce friction.
Mentorship & enablement: Coaches senior engineers; raises bar through reviews, tech talks, and documentation.
Transparent communication: Publishes architecture decisions (ADRs), scorecards, and incident postmortems; drives org learning.

Responsible AI

Safety-first: Insists on multi-layer guardrails and auditability; stops launches when safety signals are insufficient.
Ethical stewardship: Advocates for privacy, fairness, and inclusion; ensures policies are codified and enforced.

Preferred Experience (Nice-to-Have)

Implemented multi-agent collaboration with negotiation protocols and conflict resolution.
Built tenant-aware memory governance and portability models.
Experience with Apigee productization and chargeback for AI services.
Hands-on with Document AI, Dataplex, and multi-region architectures.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.