Website:
dsu.edu.in
Job details:
Job Description: HPC Systems Administrator
Department: AI & HPC Infrastructure
Location: DSU – Main Campus
Role Overview
The HPC Systems Administrator will own the NVIDIA Quantum‑2 InfiniBand fabric, Slurm job scheduling, and GPU driver/runtime orchestration across our AI/HPC cluster. The immediate priority is to configure a non‑blocking, all‑to‑all communication topology for 160 GPUs to support large‑scale distributed training with minimal latency and maximal throughput.
Key Responsibilities
1) InfiniBand Fabric (NVIDIA Quantum‑2)
- Design, deploy, and manage the NVIDIA Quantum‑2 IB fabric (NDR/HDR) including spine/leaf switches, port profiles, link policies, and topology validation.
- Maintain subnet managers (UFM/NVSM/opensm), routing engines, partitions (PKeys), SL/QoS policies, and congestion control (ECN/DCQCN).
- Monitor and remediate fabric health: link flaps, credit starvation, head‑of‑line blocking, FEC errors, and symbol/packet error counters.
- Ensure lossless, non‑blocking paths for all‑reduce/collective ops (NCCL) and MPI‑based workloads.
2) Slurm Job Scheduling & Cluster Ops
- Administer Slurm (ctld/cmd, dbd/sql) with fair‑share, QOS, partitions for CPU/GPU nodes, GRES GPU accounting, and preemption policies.
- Implement job placement rules to co‑locate multi‑node GPU jobs on optimal fabric domains; tune topology plugins and SelectType (cons_tres).
- Automate image/provisioning (xCAT/MAAS/Kickstart) and config management (Ansible) across nodes.
- Maintain maintenance windows, rolling upgrades, and change control.
3) GPU Driver & Runtime Orchestration
- Manage NVIDIA GPU drivers, CUDA, cuDNN, NCCL (including NCCL RDMA/Sharp/NVLink settings), MIG profiles, and DCGM monitoring.
- Validate compatibility matrix across OS kernels, drivers, container runtimes (Docker, Singularity/Apptainer), and framework versions.
- Build golden images/containers for training stacks (PyTorch/TensorFlow/JAX), compile and tune comms libraries (UCX, SHARP, NVSHMEM).
4) Performance, Reliability & Observability
- Benchmark and optimize end‑to‑end throughput (NCCL tests, ib_send_lat/ib_write_bw, OSU microbenchmarks).
- Implement monitoring with DCGM, Prometheus/Grafana, UFM telemetry, Slurm accounting (sacct), and alerting (Alertmanager).
- Lead incident response, RCA, and preventive actions for fabric, scheduler, GPU, and storage interactions.
5) Security, Compliance & Documentation
- Enforce RBAC, user isolation, secure multi‑tenancy, and network segmentation (PKeys, VLANs where applicable).
- Patch OS/firmware/drivers with staged rollouts and rollback plans.
- Maintain architecture diagrams, runbooks, SOPs, and capacity plans.
Priority Deliverable
- Within 60–90 days: Achieve non‑blocking, all‑to‑all communication across 160 GPUs for large‑scale model training with target benchmarks (agree on numbers; e.g., ≥ X TB/s aggregate all‑reduce throughput; ≤ Y µs 99p latency on 8/16/32/64‑node jobs).
- Validate with NCCL tests, SHARP collectives, and representative training runs; publish a tuning guide for researchers.
Required Skills & Experience
Technical Must‑Haves
- 4–10 years administering HPC/AI clusters (production).
- Deep hands‑on with InfiniBand (HDR/NDR): fabric design, UFM/opensm, QoS/SL, PKeys, congestion control.
- Strong Slurm expertise: partitions, QOS/fairshare, GRES GPU, cgroup integration, topology‑aware placement, accounting (MySQL/Postgres).
- NVIDIA stack: CUDA, cuDNN, NCCL, DCGM, NVML, MIG; driver lifecycle and compatibility.
- Linux (RHEL/Ubuntu), kernel tuning (IRQ affinity, hugepages, NUMA, BIOS power profiles), and systemd‑level operations.
- Automation: Ansible, Bash/Python scripting; image/provisioning tools (xCAT/MAAS/Kickstart).
- Containers for HPC/AI: Docker, Singularity/Apptainer; registry management and CVE hygiene.
Good‑to‑Have
- Experience with NVIDIA SHARP, NVSHMEM, UCX/UCX‑Py, and MPI stacks (OpenMPI/MPICH).
- Storage familiarity for AI/HPC: NVMe tiers, parallel FS (Lustre/BeeGFS/GPFS), and I/O path tuning.
- Monitoring/observability: Prometheus/Grafana, ELK/Opensearch, UFM telemetry.
- Security & multi‑tenancy (LDAP/AD, FreeIPA, Keycloak, IAM integrations).
Soft Skills
- Strong incident management and RCA discipline.
- Clear written documentation; ability to mentor researchers and junior admins.
- Stakeholder communication across infrastructure, security, and research teams.
Qualifications
- Bachelor’s/Master’s in Computer Science, Electrical/Computer Engineering, or related field.
- Relevant certs a plus: NVIDIA (DGX/Networking), RHCE/LFCS, CKA/CKS.
KPIs & Success Metrics
- Throughput/latency targets for all‑reduce and multi‑node training.
- Fabric health: zero critical alerts, low CRC/FEC error rates, consistent link utilization, minimal congestion events.
- Scheduler efficiency: high cluster utilization (>80%) with fair‑share compliance and reduced wait times for priority queues.
- Change reliability: zero unplanned outages during driver/firmware upgrades.
- MTTR/MTBF improvements and SLA adherence.
Date: 20-03-2026
Dr. D. Premachandra Sagar
Pro Chancellor, DSU
Click on Apply to know more.