Report

Senior Software Engineer, Compute ML Scheduling and Observability

Min Experience

5 years

Location

San Francisco, CA, New York City, NY, Seattle, WA

JobType

full-time

Overview

About the role

As a Senior Software Engineer on the Compute ML Scheduling and Observability team, you will be responsible for building and improving the systems that power Anthropic's machine learning training and inference infrastructure. This includes designing and implementing the scheduling and monitoring systems that ensure efficient and reliable utilization of our compute resources. You will work closely with our machine learning engineers, infrastructure teams, and product managers to continuously improve the capabilities of our compute platform. Key Responsibilities: - Design and implement robust and scalable scheduling, monitoring, and observability systems for Anthropic's ML training and inference workloads - Build systems that enable efficient and reliable compute utilization, including techniques like preemption, resource sharing, and multi-tenancy - Develop metrics, dashboards, and alerting systems to provide full visibility into the health and performance of our compute infrastructure - Collaborate with ML engineers, infrastructure teams, and product managers to identify and implement new capabilities and optimizations - Contribute to the overall technology strategy and roadmap for Anthropic's compute platform Qualifications: - 5+ years of experience as a software engineer, with a proven track record of delivering complex, high-impact systems - Strong background in distributed systems, concurrent programming, and performance engineering - Proficient in at least one modern programming language (e.g., Python, Go, Rust) - Experience with cloud infrastructure, containers, and orchestration tools (e.g., Kubernetes) - Familiarity with machine learning workflows and the unique requirements of ML infrastructure - Excellent problem-solving and communication skills, with the ability to work collaboratively across teams - Passion for building reliable, scalable, and high-performance systems

About the company

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Skills

python

rust

kubernetes