Site Reliability Engineer (SRE) – GPU Infrastructure

Nava

full-time

Required skills

Website: nava.com
Job details:
Role & Responsibilities

Design, deploy, and maintain GPU-accelerated infrastructure on Kubernetes (EKS/GKE/AKS) and bare-metal clusters with NVIDIA GPU operators.
Automate deployment, scaling, and failover of AI workloads using Terraform, Ansible, and CI/CD pipelines (GitLab CI, ArgoCD).
Implement observability with Prometheus, Grafana, and distributed tracing to monitor GPU utilization, memory, and job latency.
Troubleshoot GPU driver, CUDA runtime, and container orchestration issues across multi-cluster, multi-region environments.
Collaborate with ML engineers to optimize job scheduling, resource isolation, and node affinity for high-throughput GPU training/inference.
Define and enforce SLOs/SLIs for AI infrastructure, automate on-call playbooks, and drive incident post-mortems to eliminate recurring failures.

Skills & Qualifications

Benefits & Culture Highlights

Work directly on bleeding-edge AI infrastructure powering global LLM and HPC workloads.
On-site collaboration with deep-tech AI/ML engineers in a high-velocity, outcome-driven culture.
Ownership to architect and scale infrastructure—no red tape, just impact.

Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.