Website:
nava.com
Job details:
Role & Responsibilities
- Design, deploy, and maintain GPU-accelerated infrastructure on Kubernetes (EKS/GKE/AKS) and bare-metal clusters with NVIDIA GPU operators.
- Automate deployment, scaling, and failover of AI workloads using Terraform, Ansible, and CI/CD pipelines (GitLab CI, ArgoCD).
- Implement observability with Prometheus, Grafana, and distributed tracing to monitor GPU utilization, memory, and job latency.
- Troubleshoot GPU driver, CUDA runtime, and container orchestration issues across multi-cluster, multi-region environments.
- Collaborate with ML engineers to optimize job scheduling, resource isolation, and node affinity for high-throughput GPU training/inference.
- Define and enforce SLOs/SLIs for AI infrastructure, automate on-call playbooks, and drive incident post-mortems to eliminate recurring failures.
Skills & Qualifications
- Must-Have
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Ansible
- NVIDIA GPU Operator
- CUDA
- GitLab CI
- Preferred
- ArgoCD
- Slack/Opsgenie alerting
- GPU profiling tools (Nsight, DCGM)
Benefits & Culture Highlights
- Work directly on bleeding-edge AI infrastructure powering global LLM and HPC workloads.
- On-site collaboration with deep-tech AI/ML engineers in a high-velocity, outcome-driven culture.
- Ownership to architect and scale infrastructure—no red tape, just impact.
Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code
Click on Apply to know more.