HPC Cloud Engineer

h3 Technologies

Location: Mumbai, Maharashtra, India
Job type: Full-time

Required skills

Python
AWS
API
Azure
Bash
CUDA
Docker
end-to-end
GCP
GPU
Kubernetes
Node
Prolog
TensorFlow
Terraform
Pytorch
Vertex

About the role

h3 Technologies

Website: h3-technologies.com
Job details:

We're looking for an HPC Cloud Engineer to build, optimize, and operate large-scale GPU clusters for ML training and inference workloads across cloud and hybrid environments.

YOE- atleast 6-7 years Location:HO Mumbai Budget- 25 LPA

Core Skills

Cluster Infrastructure

Manage & Plan HPC cluster environments for large-scale distributed ML training and inference
Configure high-bandwidth interconnects (InfiniBand, EFA, RoCE) and tune NCCL/MPI for low-latency GPU-to-GPU communication
Manage GPU compute fleets across on-demand, Spot, and Reserved capacity with cost-aware scaling policies

Slurm Administration

Design and tune Slurm partitions, QOS policies, fairshare trees, and priority weights for multi-team environments
Build job submission templates, Prolog/Epilog hooks, and REST API integrations
Troubleshoot scheduling bottlenecks, daemon failures, and node health issues

ML Training Environments

Build GPU-optimized images with CUDA, cuDNN, NCCL, and PyTorch/JAX/TensorFlow stacks
Set up parallel file systems (Lustre, GPFS, BeeGFS) for high-throughput dataset pipelines
Run all-reduce/all-gather benchmarks and tune cluster parameters for maximum training throughput

LLM Inference Infrastructure

Deploy and optimize inference serving stacks (vLLM, TensorRT-LLM, Triton Inference Server) for large-scale model serving
Implement tensor parallelism, pipeline parallelism, and continuous batching strategies to meet throughput and latency SLAs
Apply quantization techniques (INT8, FP8, AWQ) and other inference optimizations to maximize GPU efficiency
Capacity plan and autoscale inference fleets based on traffic patterns and cost targets

Automation & Observability

Manage infrastructure via Terraform or IaC tooling; write Python/Bash lifecycle automation scripts
Deploy Prometheus, Grafana, and cloud-native monitoring for GPU utilization, cluster health, and inference latency metrics
Enforce security best practices: least-privilege IAM, network segmentation, and encryption at rest and in transit

Bonus Skills

Experience with managed ML platforms (SageMaker, Vertex AI, Azure ML) for end-to-end MLOps workflows
Container-based HPC workflows using Kubernetes, Docker, Nvidia Containers
Cost optimization: Spot/preemptible instance strategies, cluster showback reporting, instance right-sizing
Familiarity with multiple cloud providers (AWS, GCP, Azure) or on-prem/hybrid HPC environments
Speculative decoding, paged attention, and MoE expert parallelism for advanced inference optimization

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.