InfiReex
Website:
infireex.com
Job details:
We are looking for a skilled AI Platform Engineer to build, manage, and enhance AI/ML infrastructure, workflows, and automation pipelines. The role involves creating scalable platforms for training and deploying machine learning models using modern orchestration, automation, and GPU acceleration technologies. The ideal candidate will work closely with data scientists and platform engineering teams to enable efficient resource management and scalable operations across cloud and hybrid ecosystems.
Key Responsibilities
AI/ML Infrastructure & Kubernetes
- Design, deploy, and manage Kubernetes environments optimized for AI/ML applications and workloads.
- Ensure scalability, reliability, and performance of containerized AI platforms.
GPU Resource Management
- Implement and manage GPU orchestration solutions such as Run:ai and related operators for workload scheduling and resource optimization.
- Enable efficient GPU allocation and utilization for AI model training and inference.
Automation & Pipeline Development
- Build and maintain Python-based automation tools and machine learning pipelines.
- Automate infrastructure deployment using Terraform and manage configurations through Ansible.
Notebook Environment & Collaboration
- Develop and maintain Jupyter Notebook environments to support experimentation, research, and collaborative model development.
NVIDIA Ecosystem Integration
- Configure and optimize NVIDIA Enterprise Suite technologies including CUDA, NeMo Framework, Triton, TensorRT, and GPU drivers to support accelerated AI computing.
MLOps & Lifecycle Management
- Implement MLOps standards and practices covering model lifecycle management, CI/CD pipelines, monitoring, and governance using tools such as MLflow and Kubeflow.
Cross-functional Collaboration
- Partner with data scientists, ML engineers, and platform teams to improve scalability, operational efficiency, and resource utilization across cloud and hybrid infrastructures.
Required Skills & Experience
- Strong programming expertise in Python with hands-on experience using ML frameworks such as TensorFlow and PyTorch.
- Practical experience with Kubernetes and container orchestration technologies.
- Familiarity with Run:ai or equivalent GPU workload scheduling platforms.
- Strong experience in infrastructure automation using Terraform and configuration management using Ansible.
- Experience working with Jupyter Notebooks in AI/ML development environments.
- Good understanding of NVIDIA Enterprise Suite technologies including CUDA, NeMo Framework, Triton, and GPU drivers.
- Knowledge of MLOps concepts, workflows, and tools such as MLflow and Kubeflow.
- Experience deploying, managing, and scaling AI/ML workloads within cloud or hybrid infrastructure environments.
Click on Apply to know more.