Report

Software Engineer, Data Acquisition

Location

India

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

Nucleus AI

Website: withnucleus.ai
Job details:

At Nucleus, we believe the quality of intelligent systems is inseparable from the quality of the data that shapes them.

We’re hiring a Software Engineer, Data Acquisition to build the pipelines and systems that acquire, process, and curate data at scale for model training and research. This role sits at the foundation of our AI stack: designing the machinery that turns diverse, messy, real-world inputs into reliable, high-quality datasets that accelerate frontier research and production systems.

You’ll work across ingestion, processing, quality controls, enrichment, and data lifecycle management. The scope is both deeply technical and highly consequential—your systems will influence how quickly Nucleus can learn, iterate, and improve.

In this role, you will

Design and build large-scale data acquisition pipelines for text, image, audio, video, and other structured or unstructured sources.
Develop systems for data ingestion, normalization, validation, deduplication, and transformation across a wide range of formats and providers.
Improve the quality, freshness, and coverage of training and research datasets through robust automation and monitoring.
Build tooling and workflows for dataset curation, filtering, metadata enrichment, and provenance tracking.
Partner with research, infrastructure, and safety teams to ensure acquired data is useful, compliant, and aligned with downstream training goals.
Optimize throughput, reliability, and cost across high-volume acquisition and processing pipelines.
Establish quality checks and observability systems that surface dataset issues early and make debugging easier at scale.

You may be a good fit if you

Have strong software engineering fundamentals and experience building reliable backend or data-intensive systems in production.
Have worked on distributed data pipelines, ETL systems, crawlers, ingestion services, or large-scale data processing platforms.
Are comfortable with Python, Go, Java, Rust, or similar languages used in systems and data engineering.
Understand how to design for scale, fault tolerance, reproducibility, and operational simplicity.
Care deeply about data quality and enjoy turning ambiguous, real-world inputs into dependable building blocks for research and products.
Communicate clearly across engineering and research functions, and can balance speed with rigor.

What makes Nucleus different

Nucleus is building large-scale intelligent systems that require exceptional foundations—across models, infrastructure, and data. Here, data acquisition is not a support function; it is strategic technical work that directly shapes what our systems can learn and do. You’ll join a team that values depth, craftsmanship, and thoughtful execution in service of ambitious goals.

If you’re excited to build the data engines behind frontier AI, we’d love to hear from you.

Click on Apply to know more.

Skills

Python

backend

data ingestion

ETL

Java

Rust