Minutes to Seconds
Website:
minutestoseconds.com
Job details:
Role Overview
The
Senior ML Platform Architect is responsible for the strategic design and execution of foundational infrastructure powering global AI initiatives. This role focuses on building fied, multi-cloud ecosystem that integrates seamless model orchestration, real-time data connectivity via high-throughput messaging, and context-aware AI services. The work ensures that the organization’s AI capabilities are vendor-agnostic, secure, and built on a resilient, high-performance core.
Core Responsibilities
Multi-Cloud Enablement & ML Strategy
- Vendor-Agnostic Architecture: Design and implementation of a multi-cloud ML strategy (AWS, GCP, Azure) to prevent vendor lock-in and optimize for global availability and cost efficiency.
- Unified Model Orchestration: Architecture of abstract infrastructure layers using Kubernetes to ensure ML training and inference workloads port across
- different cloud providers without code modification.
- Global Connectivity: Establishment of cross-cloud networking and identity standards to maintain a consistent security posture and data access layer across all environments.
MCP (Model Context Protocol) Server Foundation
- Contextual Architecture: Building and scaling the MCP Server foundation, enabling the decoupling of AI reasoning from tool execution for enhanced modularity.
- Standardized Integration: Design of universal adapter layers that allow Large Language Models (LLMs) to securely access external databases, APIs, and internal file systems through standardized protocols.
- Governance & Discovery: Architecture of centralized discovery services for
- MCP servers to allow AI agents to dynamically find and invoke capabilities with strict audit trails.
High-Throughput Message Bus & Data Flow
- Event-Driven AI Backbone: Design and implementation of a low-latency, high-throughput message bus (e.g., Kafka, Pulsar) to handle real-time data streaming and asynchronous ML pipeline triggers.
- Scalable Feature Distribution: Architecture of the backbone for streaming features and model events, ensuring high-volume inference logs and telemetry data are ingested with zero data loss.
- System Decoupling: Utilization of the message bus to decouple ML microservices, increasing the horizontal scalability and fault tolerance of the AI platform.
Strengthening the Core Application Layer
- Leadership of the Security, Resilience, and Quality of Release chapter:
- Security: Implementation of Zero-Trust architecture for AI workloads, including model weight encryption, secure secret management, and protection against adversarial attacks.
- Resilience: Design of self-healing systems, multi-region failover strategies, and high-availability ML services to ensure mission-critical uptime.
- Quality of Release: Establishment of automated, architecture-level release gates including performance benchmarking, security scanning, and automated canary/blue-green deployment strategies.
Requirements
Technical Requirements
- Experience: 10+ years of professional experience in Systems Architecture or Software Engineering, with at least 4+ years specifically dedicated to ML Platform or AI Infrastructure.
- Cloud Mastery: Expert-level proficiency in architecting for Multi-Cloud environments and managing distributed systems at scale.
- Streaming: Messaging: Proven track record with Kafka, Pulsar, or similar high-throughput event-streaming technologies.
- Protocols &a AI Integration: Strong understanding of JSON-RPC, Server-Sent Events (SSE), and modern AI communication protocols like MCP.
- Engineering Standards: Mastery of Kubernetes, Terraform (IaC), and service mesh technologies to maintain platform stability
Click on Apply to know more.