HPC Systems Administrator

Dayananda Sagar University

Location: Bengaluru, Karnataka, India
Job type: Full-time

Required skills

Python
accounting
Ansible
Bash
compliance
CUDA
Docker
firmware
GPU
incident response
LDAP
Linux
MySQL
OpenMPI
Postgres
RHEL
SQL
TensorFlow
Ubuntu
Pytorch

About the role

Website: dsu.edu.in
Job details:

Job Description: HPC Systems Administrator

Department: AI & HPC Infrastructure

Location: DSU – Main Campus

Role Overview

The HPC Systems Administrator will own the NVIDIA Quantum‑2 InfiniBand fabric, Slurm job scheduling, and GPU driver/runtime orchestration across our AI/HPC cluster. The immediate priority is to configure a non‑blocking, all‑to‑all communication topology for 160 GPUs to support large‑scale distributed training with minimal latency and maximal throughput.

Key Responsibilities

1) InfiniBand Fabric (NVIDIA Quantum‑2)

Design, deploy, and manage the NVIDIA Quantum‑2 IB fabric (NDR/HDR) including spine/leaf switches, port profiles, link policies, and topology validation.
Maintain subnet managers (UFM/NVSM/opensm), routing engines, partitions (PKeys), SL/QoS policies, and congestion control (ECN/DCQCN).
Monitor and remediate fabric health: link flaps, credit starvation, head‑of‑line blocking, FEC errors, and symbol/packet error counters.
Ensure lossless, non‑blocking paths for all‑reduce/collective ops (NCCL) and MPI‑based workloads.

2) Slurm Job Scheduling & Cluster Ops

Administer Slurm (ctld/cmd, dbd/sql) with fair‑share, QOS, partitions for CPU/GPU nodes, GRES GPU accounting, and preemption policies.
Implement job placement rules to co‑locate multi‑node GPU jobs on optimal fabric domains; tune topology plugins and SelectType (cons_tres).
Automate image/provisioning (xCAT/MAAS/Kickstart) and config management (Ansible) across nodes.
Maintain maintenance windows, rolling upgrades, and change control.

3) GPU Driver & Runtime Orchestration

Manage NVIDIA GPU drivers, CUDA, cuDNN, NCCL (including NCCL RDMA/Sharp/NVLink settings), MIG profiles, and DCGM monitoring.
Validate compatibility matrix across OS kernels, drivers, container runtimes (Docker, Singularity/Apptainer), and framework versions.
Build golden images/containers for training stacks (PyTorch/TensorFlow/JAX), compile and tune comms libraries (UCX, SHARP, NVSHMEM).

4) Performance, Reliability & Observability

Benchmark and optimize end‑to‑end throughput (NCCL tests, ib_send_lat/ib_write_bw, OSU microbenchmarks).
Implement monitoring with DCGM, Prometheus/Grafana, UFM telemetry, Slurm accounting (sacct), and alerting (Alertmanager).
Lead incident response, RCA, and preventive actions for fabric, scheduler, GPU, and storage interactions.

5) Security, Compliance & Documentation

Enforce RBAC, user isolation, secure multi‑tenancy, and network segmentation (PKeys, VLANs where applicable).
Patch OS/firmware/drivers with staged rollouts and rollback plans.
Maintain architecture diagrams, runbooks, SOPs, and capacity plans.

Priority Deliverable

Within 60–90 days: Achieve non‑blocking, all‑to‑all communication across 160 GPUs for large‑scale model training with target benchmarks (agree on numbers; e.g., ≥ X TB/s aggregate all‑reduce throughput; ≤ Y µs 99p latency on 8/16/32/64‑node jobs).
Validate with NCCL tests, SHARP collectives, and representative training runs; publish a tuning guide for researchers.

Required Skills & Experience

Technical Must‑Haves

4–10 years administering HPC/AI clusters (production).
Deep hands‑on with InfiniBand (HDR/NDR): fabric design, UFM/opensm, QoS/SL, PKeys, congestion control.
Strong Slurm expertise: partitions, QOS/fairshare, GRES GPU, cgroup integration, topology‑aware placement, accounting (MySQL/Postgres).
NVIDIA stack: CUDA, cuDNN, NCCL, DCGM, NVML, MIG; driver lifecycle and compatibility.
Linux (RHEL/Ubuntu), kernel tuning (IRQ affinity, hugepages, NUMA, BIOS power profiles), and systemd‑level operations.
Automation: Ansible, Bash/Python scripting; image/provisioning tools (xCAT/MAAS/Kickstart).
Containers for HPC/AI: Docker, Singularity/Apptainer; registry management and CVE hygiene.

Good‑to‑Have

Experience with NVIDIA SHARP, NVSHMEM, UCX/UCX‑Py, and MPI stacks (OpenMPI/MPICH).
Storage familiarity for AI/HPC: NVMe tiers, parallel FS (Lustre/BeeGFS/GPFS), and I/O path tuning.
Monitoring/observability: Prometheus/Grafana, ELK/Opensearch, UFM telemetry.
Security & multi‑tenancy (LDAP/AD, FreeIPA, Keycloak, IAM integrations).

Soft Skills

Strong incident management and RCA discipline.
Clear written documentation; ability to mentor researchers and junior admins.
Stakeholder communication across infrastructure, security, and research teams.

Qualifications

Bachelor’s/Master’s in Computer Science, Electrical/Computer Engineering, or related field.
Relevant certs a plus: NVIDIA (DGX/Networking), RHCE/LFCS, CKA/CKS.

KPIs & Success Metrics

Throughput/latency targets for all‑reduce and multi‑node training.
Fabric health: zero critical alerts, low CRC/FEC error rates, consistent link utilization, minimal congestion events.
Scheduler efficiency: high cluster utilization (>80%) with fair‑share compliance and reduced wait times for priority queues.
Change reliability: zero unplanned outages during driver/firmware upgrades.
MTTR/MTBF improvements and SLA adherence.

Date: 20-03-2026

Dr. D. Premachandra Sagar

Pro Chancellor, DSU

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.