CirrusLabs
Website:
cirruslabs.io
Job details:
We are hiring a talented to join our team. If you're excited to be part of a winning team, CirrusLabs (http://www.cirruslabs.io) is a great place to grow your career.
Experience: 3-6 years
Shift Time: 2 PM- 11 PM IST
We are seeking a Platform Site Reliability Engineer (SRE) to support the reliability, observability, and day-2 operations of modern AI platform environments running performance-sensitive workloads. This role is suited for someone with hands-on experience in production support, monitoring, alerting, incident response, Linux troubleshooting, and operational automation across platform and infrastructure layers.
The ideal candidate has experience with Prometheus, Grafana, and logging/metrics platforms, and can work across compute, platform, DevOps, storage, and network teams to improve service health, reduce alert noise, speed up incident resolution, and strengthen overall platform reliability.
Key Responsibilities
- Support reliability and day-2 operations for production platform environments.
- Build and maintain monitoring, alerting, dashboards, and operational reporting across infrastructure and platform services.
- Use tools such as Prometheus, Grafana, and related observability platforms to track health, availability, capacity, and performance.
- Troubleshoot issues across Linux hosts, containers, platform services, and infrastructure dependencies.
- Support incident detection, triage, root cause analysis, and post-incident improvements.
- Tune alerts and service checks to improve signal quality and reduce false positives.
- Partner with platform, compute, storage, DevOps, and network teams to isolate and resolve production issues.
- Automate repetitive operational tasks using Bash, Python, Ansible, or similar tools.
- Maintain runbooks, monitoring standards, alert documentation, and operational procedures.
- Contribute to continuous improvement through standardization, automation, and reliability best practices.
Must Have Skills (3–6 years)
- Strong Linux administration and troubleshooting skills
- Experience supporting production environments with focus on uptime and operational stability
- Experience writing automated tests or synthetic checks for infrastructure/platform validation
- Experience with Kubernetes, containers, and distributed platform environments
- Hands-on experience with monitoring and alerting in production systems
- Experience with Prometheus, Grafana, or similar observability tools
- Ability to troubleshoot issues across host, service, infrastructure, and platform layers
- Experience with incident triage, support operations, and runbook-driven response
- Basic scripting or automation experience using Bash, Python, or Ansible
- Strong collaboration skills across platform, infrastructure, DevOps, and support teams
- Experience creating or maintaining dashboards, alerts, SOPs, and operational documentation
- Strong adherence to Agile or Kanban ways of working, including delivering work within defined cadences or flow-based priorities and providing consistent, proactive status updates on progress, risks, and blockers to ensure transparency and predictability.
Nice to Have Skills
- Experience with ELK, Loki, OpenSearch, or similar logging tools
- Experience with NVIDIA GPU infrastructure (DCGM, GPU Operator, NVAIE)
- Exposure to hardware-level telemetry (BMC/IPMI, firmware health, thermal/power monitoring – lower level data points)
- Exposure to telemetry, exporters, instrumentation, and service health checks
- Experience with capacity monitoring, trend analysis, and performance reporting
- Familiarity with RCA, postmortems, SLI/SLO concepts, and reliability improvement practices
- Exposure to CI/CD pipelines and Git-based operational workflows
Click on Apply to know more.