About the role
Chroma's users put a lot of trust in us. They rely on us to keep their data safe, secure, and available. Chroma must live up to that trust.
As a Production Engineer at Chroma, you will play a critical role in ensuring that Chroma's cloud service maintains high availability, providing exceptional performance for AI developers globally.
This role blends software engineering with systems engineering, where you'll focus on writing code to improve reliability and scalability while ensuring the infrastructure runs without downtime.
Your contributions will help Chroma build a highly resilient platform to support AI applications at scale, while developing the tools and processes that enable smooth operation and continuous improvement.
You will:
Build the infrastructure that guarantees Chroma's cloud services are highly available, durable, and running at optimal performance.
Design and implement robust disaster recovery strategies and fault-tolerant systems, incorporating lessons from post-incident reviews to continuously improve infrastructure resilience and adaptability.
Design and implement scalable monitoring, alerting, and self-healing systems across Chroma's infrastructure.
Write high-quality, efficient code to automate processes, enhance system reliability, and minimize operational overhead.
Develop and maintain documentation, operational procedures, and capacity plans to ensure preparedness for future scaling.
Proactively identify potential reliability and performance bottlenecks, addressing issues before they impact users.
About the company
Retrieval is the data infrastructure for AI and a critical component in this new software development stack.
Chroma is proud to be the leading retrieval system, trusted and loved by developers around the world.
We're still in the pre-history of AI. We're looking for curious people who are dedicated to becoming world-class at their craft to join our team.
There is a lot of important work to do. Join us.