Autonomous System Engineer (Platform)

Talentiser

full-time

Required skills

Python
Java
Root Cause Analysis
SRE

About the role

Talentiser

Website: talentiser.com
Job details:

Job Description

About the Role

We're looking for an Autonomous Systems Engineer to own the reliability, operability, and evolution of our internal engineering platform. This is a hands-on role at the intersection of platform engineering, reliability, and intelligent automation with a clear mandate: reduce toil, improve observability, and enable systems (AI agents) to safely operate at scale.

You'll work directly with engineering teams to harden services, respond to incidents, and build automation that makes the platform increasingly self-managing over time. A key aspect of this role is designing and operating AI-driven and agent-based workflows, including the guardrails, validation systems, and observability needed to allow automated systems to safely generate and act on changes in production environments.

What You'll Do

Own reliability, availability, and performance of the internal platform and critical services

Participate in on-call rotations; lead incident triage, debugging, root cause analysis, and post-mortems

Build and operate platform automation and AI-powered workflows (including agent-based systems) to reduce manual operational effort

Design and implement guardrails, validation pipelines, and safety mechanisms for automated and AI-generated changes to code and infrastructure

Enable closed-loop automation systems (detect → diagnose → remediate → validate) to improve system resilience

Define and track SLIs and SLOs; use reliability data to guide engineering decisions

Standardize build, deployment, and release workflows for safe, predictable delivery, including automation-friendly and AI-integrated pipelines

Identify and remediate security vulnerabilities across systems and services, including risks introduced by automated changes

Partner with development teams on service design, resilience, and operability, with an emphasis on automation-first and AI-compatible system design

Required Qualifications

5+ years of experience operating production platforms or large-scale distributed systems

Proven track record in incident management, on-call operations, and production debugging

Strong programming skills in Java, Python, Go, Shell, or equivalent

Hands-on experience with observability tooling (monitoring, alerting, logging, tracing)

Experience building or maintaining CI/CD pipelines and release processes

Familiarity with platform upgrades, dependency management, and system lifecycle operations

Experience building or integrating AI-driven (agent-based) automation frameworks, or strong interest in this space

Working knowledge of Linux-based production environments

Strong communication and cross-team collaboration skills

Nice to Have

Experience with SRE frameworks: SLOs, error budgets, reliability reviews

Experience with chaos engineering or resilience testing

Background in building self-healing systems

History of driving platform standardization across large engineering organizations

Key Traits

Strong ownership mentality. Calm under pressure. Bias toward automation. Systems thinker who doesn't just fix problems but builds systems that prevent, detect, and autonomously remediate issues over time.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.