Website:
larsentoubrovyoma.com
Job details:
Role Overview
We are seeking a seasoned Project Director – HPC / AI Infrastructure Deployment to lead large-scale, high-density compute programs involving GPU clusters, HPC workloads, and AI infrastructure. The role demands end-to-end ownership of deploying 10+ MW IT load data center environments, ensuring delivery of high-performance GPU-based compute platforms with cutting-edge networking and storage architectures.
Roles & Responsibilities
- Lead and deliver large-scale HPC / AI GPU cluster deployments (e.g., NVIDIA B200 / B300 GPU platforms) within defined timelines and budgets
- Drive execution of AI stack deployment (e.g., NVIDIA NVAIE) across hybrid/cloud/on-prem environments
- Manage multi-vendor ecosystems including OEMs, SI partners, and hyperscale technology providers
- Deploy and scale high-density GPU racks with liquid/air-cooled thermal strategies
- Design and oversee InfiniBand (IB) and high-speed Ethernet networks
- Experience with NVIDIA/Mellanox InfiniBand fabrics
- Configuration and optimization using UFM (Unified Fabric Manager)
- Strong understanding of BCM (Broadcom Ethernet switching) platforms
- Architect and implement Leaf-Spine network topology for ultra-low latency AI workloads
- Ensure effective integration of storage systems (parallel file systems, NVMe-based storage)
- Oversee deployment of Kubernetes-based GPU orchestration platforms
- Experience with containerized AI workloads and distributed training clusters
- Exposure to NVIDIA AI Enterprise (NVAIE), CUDA, and GPU virtualization frameworks
- Manage data center design, build, and repurposing for HPC workloads
- Oversee MEP (Mechanical, Electrical, Plumbing) systems implementation
- Enure optimized thermal management (liquid cooling, rear door heat exchangers, immersion cooling where applicable)
- Ensure optimized power density (kW/rack) planning
- Ensure optimized energy efficiency (PUE optimization)
- Establish robust governance frameworks aligned to:
a. HLD/LLD design validation
b. SOP adherence
c. Quality assurance benchmarks
- Implement risk mitigation strategies for large-scale deployments (supply chain, OEM dependencies, technology integration risks)
- Monitor program milestones and ensure SLA-based deliveries
- Drive structured cabling design (fiber-heavy HPC fabric, spine-leaf connectivity)
Qualifications & Experience
- B.E/B.Tech in Electrical / Electronics / Computer Science Engineering
- 15–25 years of experience in Data center infrastructure deployment, HPC / AI workload environments, large-scale IT infrastructure programs
Mandatory / Preferred Certifications
- PMP / PRINCE2 (mandatory for program governance)
- CDCP / CDCS / CDCPM certifications
Strongly preferred:
- NVIDIA AI Infrastructure / DGX / AI Factory certifications
- OEM certifications (Dell, HPE, Lenovo HPC systems)
Click on Apply to know more.