Nucleus AI
Website:
withnucleus.ai
Job details:
Reliable systems make ambitious products possible. As a Software Engineer, Reliability at Nucleus, you will improve the uptime, resilience, and operational maturity of our critical product and platform services. You’ll work across architecture, observability, incident response, and engineering workflows to help our systems recover gracefully, fail more safely, and scale with confidence. This is a hands-on role for engineers who care about both systems design and operational excellence.
In this role, you will
- Improve reliability across critical backend, platform, and infrastructure services.
- Build tooling and automation for incident response, remediation, service health, and operational workflows.
- Strengthen observability, alerting, capacity planning, and failure analysis across distributed systems.
- Help improve on-call quality, runbooks, and post-incident learning practices.
- Partner with engineering teams to identify reliability risks early and design more resilient systems.
You may be a good fit if you have
- Strong experience in distributed systems, SRE, backend infrastructure, or production engineering.
- Familiarity with observability, incident management, service ownership, and reliability engineering practices.
- A track record of improving operational health through both systems design and cultural change.
- Comfort debugging complex production issues under real constraints.
What makes Nucleus different
At Nucleus, reliability work is deeply tied to mission-critical AI systems. Your work will directly shape the quality and trustworthiness of products people depend on every day.
- If you believe reliability is a product feature — and a discipline — we’d love to talk.
Click on Apply to know more.