Report

Software Engineer, AI Training Infrastructure

Salary

$175k - $190k

Min Experience

3 years

Location

Redwood City, CA, New York, NY

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

The Role: As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development. Key Responsibilities: Design and implement scalable infrastructure for large-scale model training workloads Develop and maintain distributed training pipelines for LLMs and multimodal models Optimize training performance across multiple GPUs, nodes, and data centers Implement monitoring, logging, and debugging tools for training operations Architect and maintain data storage solutions for large-scale training datasets Automate infrastructure provisioning, scaling, and orchestration for model training Collaborate with researchers to implement and optimize training methodologies Analyze and improve efficiency, scalability, and cost-effectiveness of training systems Troubleshoot complex performance issues in distributed training environments Minimum Qualifications: Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience 3+ years of experience with distributed systems and ML infrastructure Experience with PyTorch Proficiency in cloud platforms (AWS, GCP, Azure) Experience with containerization, orchestration (Kubernetes, Docker) Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP) Preferred Qualifications: Master's or PhD in Computer Science or related field Experience training large language models or multimodal AI systems Experience with ML workflow orchestration tools Background in optimizing high-performance distributed computing systems Familiarity with ML DevOps practices Contributions to open-source ML infrastructure or related projects

About the company

At Fireworks AI, we're building the infrastructure that powers the next generation of AI applications. From real-time inference to model optimization, our platform empowers developers and enterprises to deploy, scale, and innovate with cutting-edge AI—faster and smarter than ever before.

Skills

python

pytorch

kubernetes

docker

distributed systems

ml infrastructure

ml devops