Website:
epiccompany.com
Job details:
Role summary
You will own the orchestration_core that composes the six orchestrators (intake, story, direction, design, production, approval) into a single LangGraph state machine with five user decision gates. You set the patterns for how agents talk to MCP servers, how state is checkpointed, how interrupts are handled at gates, and how cost and latency budgets are enforced per node.
Minimum Work Experience
• 3 to 7 Years with the mentioned skills
Must-have skills
• Production LangGraph (Python). StateGraph, Send API, interruptBefore/interruptAfter, conditional edges, subgraphs. You have shipped at least one agentic system to production users.
• LangGraph checkpointing. PostgresSaver, thread state, run replay, resume-from-checkpoint patterns. You understand why checkpointing is the difference between a demo and a product.
• Anthropic SDK. Tool use loops, streaming, prompt caching, vision, structured outputs. You have written your own model router with progressive degradation (e.g., Opus → Sonnet → Haiku).
• MCP (Model Context Protocol). You have built or significantly contributed to MCP servers using the official Python SDK. You understand the security boundary MCP creates and why you never let an agent talk to a database directly.
• Async Python at depth. asyncio, anyio, httpx async, asyncpg, redis.asyncio. You can explain exactly how a synchronous call poisons an event loop and have debugged it in production.
• FastAPI + Pydantic v2. Depends() patterns, response_model, BackgroundTasks, middleware, WebSockets. Strict mode Pydantic, field validators, model_validator.
• LLM observability. Langfuse traces, generation spans, cost attribution, prompt registries. You can answer 'why did this run cost X?' in under 30 seconds.
• System design for multi-tenant agents. RLS, tenant context propagation across agents, cost ceilings per tenant, rate limiting, idempotency.
Day-to-day responsibilities
• Design and own the master graph in orchestration_core, including all 5 user decision gates and resume-from-checkpoint flows.
• Set conventions for orchestrator-to-orchestrator handoff, signal emission (Redis Streams), and cost tracking per node.
• Code-review every orchestrator change; mentor the mid-senior AI engineer.
• Own the model routing policy (Opus/Sonnet/Haiku/Gemini) and the progressive degradation chain.
• Coordinate with Track B on serving contracts — define how fine-tuned text and video models are exposed to orchestrators.
• Partner with QA on golden-path, failure-path, and recovery-path test scenarios.
• Drive architectural decisions on prompt caching, semantic cache, and idempotency keys.
• Represent EPIC Co-Director in technical reviews with vendors (Anthropic, Google, video model providers).
Click on Apply to know more.