Voice AI Engineer

Kratos Gamer Network

full-time

Required skills

Python
AWS
API
Docker
FastAPI
FFmpeg
gaming
Gate
GCS
Session Management

About the role

Kratos Gamer Network

Website: kgen.io
Job details:

About KGeN

KGeN is building the Verified Distribution Protocol (VeriFi) for AI, DeFi, and Gaming — built on real users and real commerce to accelerate growth for projects across these industries.

Since its founding by global leaders in the consumer and gaming sectors, KGeN has grown to become the dominant growth engine in the Global South.

With 45.7 million users, 6.7 million monthly active users, and $64 million in annualized revenue, KGeN delivers verified user acquisition, on-chain loyalty programs, and decentralized storefronts via its POGE identity and reputation framework and a global clan network spanning more than 60 countries.

We build structured, high-signal voice datasets for frontier AI labs — with a specific focus on the Global South: Indic, African, and Southeast Asian languages that current speech AI dramatically underserves. Your job is to own the engineering layer that makes that possible — from raw audio collection through annotation, automated QC, and delivery. You'll build proprietary tooling on top of existing models, not just integrate them. The pipeline you build directly determines the quality signal that reaches model teams at frontier labs — for the languages that need it most.

Raw data collection infrastructure

Design and build collection pipelines that capture audio with structured metadata — recorder configs, environment tagging, speaker demographics, session management. Engineer ingestion systems that handle diverse sources (studio, field, crowd-sourced, synthetic) with consistent schema from day one.

Annotation pipeline engineering

Build and own the tooling stack for forced alignment, speaker diarisation, utterance segmentation, and transcript normalisation. Integrate and fine-tune open models (Whisper, WhisperX, pyannote) as pipeline components — not endpoints. Build the human-in-the-loop interface layer that connects automated outputs to annotation workflows.

Automated QC and rejection systems

Design scoring systems that gate data quality before it reaches annotation or delivery — SNR thresholds, clipping detection, annotation consistency checks, accent/dialect coverage tracking. Build supplier validation pipelines with objective rejection signals, not just flags.

Proprietary tooling on top of foundation models

Understand what ASR/TTS models do well and where they fail — especially for low-resource, tonal, and agglutinative Global South languages. Build internal tooling that extends, corrects, and orchestrates them: custom post-processing, domain-specific normalisation, error-pattern detectors, and model-ensemble logic.

Global South & Indic language support

Engineer annotation and evaluation tooling built specifically for Global South languages — Indic (Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Gujarati), African (Swahili, Amharic, Yoruba, Hausa), and Southeast Asian (Bahasa, Tagalog, Vietnamese). Handle code-switching, transliteration, script-level normalisation, tonal variation, and low-resource dialect segments with purpose-built logic, not generic fallbacks. These languages are not edge cases — they are the mission.

Evaluation and benchmarking

Implement WER, CER, MOS, SNR, and latency metrics grounded in what downstream models actually need to learn — with specific attention to failure modes that are unique to Global South language phonology, morphology, and script systems.

Data versioning and lineage

Build auditable lineage tracking so every dataset iteration — what changed, why, and what it produced — is reproducible. Every experiment re-runnable with a single command.

Must Have

2–5 years of engineering experience in speech AI, audio ML, or ML data infrastructure — with direct ownership of at least one raw-to-labeled audio pipeline that a model team trained on.
Hands-on experience with ASR integration and fine-tuning — Whisper, WhisperX, SpeechBrain, Kaldi, or equivalent. You know where they break, not just how to call them.
Fluency with audio data tooling at scale — FFmpeg, librosa, torchaudio, pyannote, soundfile.
Strong Python — audio processing, API integration, pipeline orchestration.
Proven ability to design automated QC systems with objective scoring signals, not just rule-based filters.
Experience shipping reproducible ML environments — Docker, CI/CD, experiment tracking.
Genuine interest in linguistic diversity, underrepresented languages, and closing the Global South speech AI gap.

Nice to Have

MLflow, W&B, or similar experiment tracking at scale.
Cloud audio storage and streaming — AWS S3, GCS — at volume.
FastAPI or similar for exposing evaluation or annotation endpoints.
Background in TTS systems or multimodal audio-visual data.
Contributions to open-source speech or audio tooling.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.