Millennium
Website:
mlp.com
Job details:
We’re a high-impact platform team building the firm’s internal AI platform that bridges traditional enterprise platforms (identity, data, workflow, governance) with GenAI tools (agents, copilots, model providers).
This is a senior engineering role focused on designing and owning the core service layer that agentic tools run on: an AI gateway, model/provider routing, policy/guardrails, tool-execution interfaces, high-throughput async APIs, and production-grade observability.. MCP (Model Context Protocol) services are part of the platform portfolio—enabling secure, governed connectivity between agent runtimes and enterprise tools/data. You’ll partner closely with AI Engineers building agent workflows—your focus is to make the underlying platform fast, reliable, secure, and easy to build on.
Key Responsibilities
- Design, build, and operate core platform services (Python; REST + async; streaming where appropriate) powering firm-wide internal AI/agentic capabilities.
- Own gateway/platform concerns end-to-end: routing, timeouts/retries, streaming, request shaping, rate limits/quotas, multi-tenancy, policy enforcement, provider abstraction, safe degradation, and robust client experience.
- Build and operate MCP capabilities as part of the platform.
- Build for scale and availability on Kubernetes: autoscaling, rollout strategies, capacity planning, performance tuning, and production debugging.
- Raise reliability practices: define and manage SLOs/SLIs, instrumentation standards, incident response/runbooks, post-incident follow-ups, load/resilience testing, and operational excellence.
- Improve delivery safety: CI/CD, environment promotion, IaC-driven repeatability, and secure SDLC practices.
- Influence roadmap and technical strategy: prioritize foundational investments and reduce platform risk for a business-critical internal platform.
Required Qualifications
- 7+ years of professional software engineering experience (or equivalent practical experience)
- Strong expertise in Python, Java, or Go, including async patterns, concurrency, and building high-throughput services (FastAPI or similar).
- Solid distributed systems fundamentals: idempotency, backpressure, failure isolation, consistency tradeoffs, rate limiting, retries/timeouts.
- Production experience operating services on Kubernetes (deployments, autoscaling, debugging, observability, performance).
- Basic familiarity with LLM integration patterns (streaming responses, tool/function calling)
- Demonstrated design leadership (RFCs, architecture reviews, leading cross-team initiatives).
- Excellent communication skills—able to translate technical tradeoffs to stakeholders and partner teams.
Preferred Qualifications
- Experience with service-to-service authentication patterns (API keys, OAuth/JWT, mTLS concepts).
- Familiarity with observability tooling (structured logs, metrics, tracing; Datadog or OpenTelemetry a plus).
- Strong fundamentals in AWS (or GCP/Azure) relevant to secure platforms (IAM, networking basics, compute, logging/monitoring patterns).
- Working proficiency with Terraform and automation-first operations (repeatable environments, policy checks, safe rollouts).
- Comfort using AI dev tools (Claude Code, Cursor, Gemini CLI) responsibly (tests, validation, secure coding).
Click on Apply to know more.