Flag job

Report

ML Platform/Infrastructure Engineer

Min Experience

5 years

Location

remote

JobType

full-time

About the job

Info This job is sourced from a job board

About the role

Observo AI has pioneered an AI-powered telemetry data pipeline that can extract the most important security data from any source, parse and transform it into the right format, automatically detect and mask sensitive data, and route it to the analytics platform with the most value. By reducing noisy data volume by 80% or more, we can typically help reduce the total cost of security by as much as 50%. We shift analytics "left" into the telemetry stream to surface anomalies before all of the data is indexed in a SIEM or logging tool so DevOps and Security teams can focus on the most critical incidents before they spiral into costly problems. This helps teams detect and resolve critical incidents 40% faster. Join our team to make an impact with a fast-growing, innovative company committed to driving success for our customers. Position Overview Observo AI is looking for a talented and experienced Senior Software Engineer – ML Infrastructure to build the foundation that powers our next-generation machine learning systems. In this role, you will design and maintain the infrastructure, tooling, and systems that enable our ML engineers to iterate faster, scale reliably, and deploy models into production seamlessly. You'll work closely with ML engineers, backend engineers, and DevOps to ensure our ML platform is efficient, robust, and future-ready. This role is ideal for someone who thrives at the intersection of infrastructure and machine learning, and who wants to build systems that power scalable and reliable AI at the core of our observability platform. Key Responsibilities Design, build, and maintain the core ML infrastructure, including training pipelines, feature stores, model registries, and model serving infrastructure. Develop tools and platforms that streamline model training, evaluation, deployment, and monitoring at scale. Collaborate with ML Engineers and DevOps to establish CI/CD workflows for ML models, including validation, versioning, and rollout strategies. Optimize performance and reliability of large-scale distributed data processing and model inference systems. Establish observability and tracing for ML pipelines to track data drift, model performance degradation, and training anomalies. Contribute to security and compliance best practices across ML workflows, including access control and auditability. Drive technical decisions around tooling, frameworks, and architecture for our ML platform.​ Hands-on experience working with LLMs, including scaling third-party hosted inference services (e.g., OpenAI, Cohere, Anthropic) and building fault-tolerant, production-grade workflows around them. Qualifications 5+ years of experience in software engineering, with at least 2 years focused on ML infrastructure, MLOps, or related systems. Strong programming skills in Python, Go, or Java, and experience with containerization (Docker, Kubernetes). Deep familiarity with ML infrastructure tools like MLflow, Kubeflow, Metaflow, TFX, SageMaker, or Vertex AI. Experience designing and running data pipelines using tools like Airflow, Prefect, or similar orchestration frameworks. Hands-on experience with cloud platforms such as AWS, GCP, or Azure, especially in the context of ML workloads. Solid understanding of CI/CD principles, especially as applied to machine learning workflows.

About the company

Observo AI has pioneered an AI-powered telemetry data pipeline that can extract the most important security data from any source, parse and transform it into the right format, automatically detect and mask sensitive data, and route it to the analytics platform with the most value. By reducing noisy data volume by 80% or more, we can typically help reduce the total cost of security by as much as 50%. We shift analytics "left" into the telemetry stream to surface anomalies before all of the data is indexed in a SIEM or logging tool so DevOps and Security teams can focus on the most critical incidents before they spiral into costly problems. This helps teams detect and resolve critical incidents 40% faster. Join our team to make an impact with a fast-growing, innovative company committed to driving success for our customers.

Skills

python
go
java
docker
kubernetes
mlflow
kubeflow
metaflow
tfx
sagemaker
vertex-ai
airflow
prefect
aws
gcp
azure