Platform Site Reliability Engineer (SRE)

CirrusLabs

full-time

Required skills

Python
Agile
Ansible
automated tests
Bash
DevOps
firmware
GPU
incident response
Kanban
Kubernetes
Linux
platform services
production support
Root Cause Analysis
SRE
uptime

About the role

CirrusLabs

Website: cirruslabs.io
Job details:

We are hiring a talented to join our team. If you're excited to be part of a winning team, CirrusLabs (http://www.cirruslabs.io) is a great place to grow your career.

Experience: 3-6 years

Shift Time: 2 PM- 11 PM IST

We are seeking a Platform Site Reliability Engineer (SRE) to support the reliability, observability, and day-2 operations of modern AI platform environments running performance-sensitive workloads. This role is suited for someone with hands-on experience in production support, monitoring, alerting, incident response, Linux troubleshooting, and operational automation across platform and infrastructure layers.

The ideal candidate has experience with Prometheus, Grafana, and logging/metrics platforms, and can work across compute, platform, DevOps, storage, and network teams to improve service health, reduce alert noise, speed up incident resolution, and strengthen overall platform reliability.

Key Responsibilities

Support reliability and day-2 operations for production platform environments.
Build and maintain monitoring, alerting, dashboards, and operational reporting across infrastructure and platform services.
Use tools such as Prometheus, Grafana, and related observability platforms to track health, availability, capacity, and performance.
Troubleshoot issues across Linux hosts, containers, platform services, and infrastructure dependencies.
Support incident detection, triage, root cause analysis, and post-incident improvements.
Tune alerts and service checks to improve signal quality and reduce false positives.
Partner with platform, compute, storage, DevOps, and network teams to isolate and resolve production issues.
Automate repetitive operational tasks using Bash, Python, Ansible, or similar tools.
Maintain runbooks, monitoring standards, alert documentation, and operational procedures.
Contribute to continuous improvement through standardization, automation, and reliability best practices.

Must Have Skills (3–6 years)

Strong Linux administration and troubleshooting skills
Experience supporting production environments with focus on uptime and operational stability
Experience writing automated tests or synthetic checks for infrastructure/platform validation
Experience with Kubernetes, containers, and distributed platform environments
Hands-on experience with monitoring and alerting in production systems
Experience with Prometheus, Grafana, or similar observability tools
Ability to troubleshoot issues across host, service, infrastructure, and platform layers
Experience with incident triage, support operations, and runbook-driven response
Basic scripting or automation experience using Bash, Python, or Ansible
Strong collaboration skills across platform, infrastructure, DevOps, and support teams
Experience creating or maintaining dashboards, alerts, SOPs, and operational documentation
Strong adherence to Agile or Kanban ways of working, including delivering work within defined cadences or flow-based priorities and providing consistent, proactive status updates on progress, risks, and blockers to ensure transparency and predictability.

Nice to Have Skills

Experience with ELK, Loki, OpenSearch, or similar logging tools
Experience with NVIDIA GPU infrastructure (DCGM, GPU Operator, NVAIE)
Exposure to hardware-level telemetry (BMC/IPMI, firmware health, thermal/power monitoring – lower level data points)
Exposure to telemetry, exporters, instrumentation, and service health checks
Experience with capacity monitoring, trend analysis, and performance reporting
Familiarity with RCA, postmortems, SLI/SLO concepts, and reliability improvement practices
Exposure to CI/CD pipelines and Git-based operational workflows

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.