h3 Technologies
Website:
h3-technologies.com
Job details:
We're looking for an HPC Cloud Engineer to build, optimize, and operate large-scale GPU clusters for ML training and inference workloads across cloud and hybrid environments.
YOE- atleast 6-7 years Location:HO Mumbai Budget- 25 LPA
Core Skills
Cluster Infrastructure
- Manage & Plan HPC cluster environments for large-scale distributed ML training and inference
- Configure high-bandwidth interconnects (InfiniBand, EFA, RoCE) and tune NCCL/MPI for low-latency GPU-to-GPU communication
- Manage GPU compute fleets across on-demand, Spot, and Reserved capacity with cost-aware scaling policies
Slurm Administration
- Design and tune Slurm partitions, QOS policies, fairshare trees, and priority weights for multi-team environments
- Build job submission templates, Prolog/Epilog hooks, and REST API integrations
- Troubleshoot scheduling bottlenecks, daemon failures, and node health issues
ML Training Environments
- Build GPU-optimized images with CUDA, cuDNN, NCCL, and PyTorch/JAX/TensorFlow stacks
- Set up parallel file systems (Lustre, GPFS, BeeGFS) for high-throughput dataset pipelines
- Run all-reduce/all-gather benchmarks and tune cluster parameters for maximum training throughput
LLM Inference Infrastructure
- Deploy and optimize inference serving stacks (vLLM, TensorRT-LLM, Triton Inference Server) for large-scale model serving
- Implement tensor parallelism, pipeline parallelism, and continuous batching strategies to meet throughput and latency SLAs
- Apply quantization techniques (INT8, FP8, AWQ) and other inference optimizations to maximize GPU efficiency
- Capacity plan and autoscale inference fleets based on traffic patterns and cost targets
Automation & Observability
- Manage infrastructure via Terraform or IaC tooling; write Python/Bash lifecycle automation scripts
- Deploy Prometheus, Grafana, and cloud-native monitoring for GPU utilization, cluster health, and inference latency metrics
- Enforce security best practices: least-privilege IAM, network segmentation, and encryption at rest and in transit
Bonus Skills
- Experience with managed ML platforms (SageMaker, Vertex AI, Azure ML) for end-to-end MLOps workflows
- Container-based HPC workflows using Kubernetes, Docker, Nvidia Containers
- Cost optimization: Spot/preemptible instance strategies, cluster showback reporting, instance right-sizing
- Familiarity with multiple cloud providers (AWS, GCP, Azure) or on-prem/hybrid HPC environments
- Speculative decoding, paged attention, and MoE expert parallelism for advanced inference optimization
Click on Apply to know more.