Talentiser
Website:
talentiser.com
Job details:
Job Description
About the Role
We're looking for an Autonomous Systems Engineer to own the reliability, operability, and evolution of our internal engineering platform. This is a hands-on role at the intersection of platform engineering, reliability, and intelligent automation with a clear mandate: reduce toil, improve observability, and enable systems (AI agents) to safely operate at scale.
You'll work directly with engineering teams to harden services, respond to incidents, and build automation that makes the platform increasingly self-managing over time. A key aspect of this role is designing and operating AI-driven and agent-based workflows, including the guardrails, validation systems, and observability needed to allow automated systems to safely generate and act on changes in production environments.
What You'll Do
Own reliability, availability, and performance of the internal platform and critical services
Participate in on-call rotations; lead incident triage, debugging, root cause analysis, and post-mortems
Build and operate platform automation and AI-powered workflows (including agent-based systems) to reduce manual operational effort
Design and implement guardrails, validation pipelines, and safety mechanisms for automated and AI-generated changes to code and infrastructure
Enable closed-loop automation systems (detect → diagnose → remediate → validate) to improve system resilience
Define and track SLIs and SLOs; use reliability data to guide engineering decisions
Standardize build, deployment, and release workflows for safe, predictable delivery, including automation-friendly and AI-integrated pipelines
Identify and remediate security vulnerabilities across systems and services, including risks introduced by automated changes
Partner with development teams on service design, resilience, and operability, with an emphasis on automation-first and AI-compatible system design
Required Qualifications
5+ years of experience operating production platforms or large-scale distributed systems
Proven track record in incident management, on-call operations, and production debugging
Strong programming skills in Java, Python, Go, Shell, or equivalent
Hands-on experience with observability tooling (monitoring, alerting, logging, tracing)
Experience building or maintaining CI/CD pipelines and release processes
Familiarity with platform upgrades, dependency management, and system lifecycle operations
Experience building or integrating AI-driven (agent-based) automation frameworks, or strong interest in this space
Working knowledge of Linux-based production environments
Strong communication and cross-team collaboration skills
Nice to Have
Experience with SRE frameworks: SLOs, error budgets, reliability reviews
Experience with chaos engineering or resilience testing
Background in building self-healing systems
History of driving platform standardization across large engineering organizations
Key Traits
Strong ownership mentality. Calm under pressure. Bias toward automation. Systems thinker who doesn't just fix problems but builds systems that prevent, detect, and autonomously remediate issues over time.
Click on Apply to know more.