HPC Engineer - Storage
World Wide Technology
- Location
- India
- Job type
- Full-time
Required skills
- Ansible
- firmware
- K8s
- kernel
- Kubernetes
- Linux
About the role
World Wide Technology
Website:
wwt.com
Job details:
- Storage Integration & Client Configuration
- Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible.
- Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory.
- Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s.
- Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes. 2. Validation & Performance Benchmarking
- Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the "Gold Standard" read/write targets (e.g., 400GB/s read throughput).
- Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing.
- Acceptance Reporting: Generate "As-Built" storage validation reports, documenting effective throughput and IOPS for client sign-off. 3. Operations & Support
- Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent "Disk Full" outages on critical scratch filesystems.
- Ticket Resolution: Handle L2 support tickets for storage issues, such as "Stale file handles," "Slow dataset loading," or "CSI Driver crashes."
- Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.,
Technical Competencies
Essential Skills
High-Performance Storage:
- Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
- Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
- RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.
Automation & Containerisation:
- Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
- Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
- GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.
Desirable Experience
- Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
- Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
- Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.
Certifications
Highly Desirable:
- NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
- Vendor Certifications:
- VAST Certified Administrator (VCP-AD1)
- WEKA Technical Xpert Certification
- Red Hat Certified Specialist in Storage Administration
Success Metrics (KPIs)
- I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
- Mount Stability: Zero "Stale File Handles" or disconnected mounts across the cluster during the 72-hour burn-in period.
- Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.
Click on Apply to know more.
This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.