PepsiCo UK
Website:
pepsico.com
Job details:
Overview
The
AI Observability Engineer (Agentic Frameworks & AI Agent Operations Center Developer) builds and operationalizes
agentic AI solutions using modern orchestration frameworks and contributes to an
AI Agent Operations Center that enables safe, reliable, and observable agent behavior at scale. This role focuses on developing agent workflows (planning, tool execution, memory, and RAG), integrating guardrails and evaluations, and delivering operational capabilities such as run management, telemetry, and incident triage for production agents.
Responsibilities
- AI Agent Operations Center (70%)
- Build “operations center” capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
- Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
- Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
- Instrument agent flows end-to-end using OpenTelemetry (or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
- Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
- Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
- Collaboration with Teams (10%)
- Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
- Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
- Integration & Deployment (10%)
- Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
- Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
- Drive best practices for secure, scalable, and cost-effective agent deployments
- Continuous Learning (10%)
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
- Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
Qualifications
Key Skills/Experience Required Minimum Qualifications:
- Education: Bachelor’s in Computer Science, AI/ML, Data Science, or a related field.
- Experience: 3–5+ years of software engineering experience; 1+ years building and observe AI/ML or GenAI applications preferred
- Required Expertise:
- Hands-on experience with agentic frameworks (Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar)
- Proficiency in Python (primary) and familiarity with APIs/microservices patterns
- Strong experience with RAG patterns (embeddings, vector search, retrieval evaluation, chunking strategies)
- Experience with cloud environments (Azure/AWS/GCP) and containerized deployments (Kubernetes/AKS/EKS)
- Familiarity with observability fundamentals (logs/metrics/traces) and production troubleshooting
- Experience building internal developer platforms or operational consoles (agent registry, run tracking, dashboards)
- Familiarity with OpenTelemetry, distributed tracg, and telemetry pipelines
- Experience with Azure AI Search / vector databases, prompt/version management, and evaluation frameworks
- Knowledge of Responsible AI practices: data handling, safety guardrails, audit trails, and redaction strategies
- FinOps exposure: token/GPU cost optimization and chargeback/showback reporting
- Technical Proficiency: Agent orchestration design (planning, tool execution, memory, RAG), Strong engineering discipline: testing, versioning, CI/CD, automation, Operational mindset: reliability, debuggability, and incident response support
- Problem-Solving: Ability to translate business challenges into technical solutions.
- Collaboration Skills: Effective at working within cross-functional teams.
- Agility: Flexibility to adapt to changing requirements and new technologies.
- Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.
Click on Apply to know more.