Senior AI Platform Engineer

Lexsi Labs

Location: Bengaluru, Karnataka, India
Job type: Full-time

Required skills

Python
backend
cloud infrastructure

About the role

Website: lexsi.ai
Job details:

Lexsi Labs is a frontier AI lab building aligned, interpretable, and safe superintelligent systems. Our work spans alignment methods, interpretability-led system design, and foundational model research across LLMs, agents, and tabular / structured-data models.

A core part of our work is turning advanced AI research into production systems. That means taking internally developed libraries and model workflows, such as AlignTune, TabTune, and DLBacktrace, and integrating them into a scalable platform infrastructure that supports training, inference, evaluation, observability, and enterprise deployment.

We are hiring a Senior AI Platform Engineer to build the platform layer that operationalizes Lexsi’s core AI systems.

This role is centered on R&D-to-platform rollout: taking technically sophisticated model systems and making them usable, reliable, and scalable inside the product stack. You will work across model pipelines, training systems, inference infrastructure, distributed execution, and backend platform architecture.

This is not a thin integration role. It requires strong engineering depth and enough model understanding to work effectively with systems involving fine-tuning, RL, alignment, interpretability, agent execution, and inference optimization across LLMs, agents, and tabular foundation models.

Responsibilities:

Own the platformization of Lexsi’s internal AI libraries, turning research-heavy systems into robust platform capabilities with stable APIs, execution layers, observability, and deployment paths.
Build and scale training and post-training infrastructure for workflows, including SFT, RL, evaluation, model adaptation, and agent optimization.
Design the integration layer between research systems and product infrastructure, including job orchestration, artifact management, dataset versioning, experiment lineage, and runtime control surfaces.
Build inference systems that can support complex model behaviors under production constraints, including latency, throughput, cost efficiency, debuggability, and safety.
Design for multi-cluster and distributed execution, including scheduling, fault tolerance, checkpointing, retries, workload isolation, and heterogeneous compute environments.
Operationalize systems such as AlignTune for fine-tuning and RL pipelines, TabTune for tabular foundation model workflows, and DLBacktrace for interpretability, tracing, and behavioral inspection.
Build common platform primitives for model lifecycle management across training, evaluation, serving, rollback, and monitoring.
Partner closely with research teams to translate model-science complexity into production architecture without flattening away the core technical value.
Improve platform reliability for long-running and failure-prone AI workloads, especially where model behavior, system behavior, and infrastructure behavior interact in non-trivial ways.
Ensure that alignment, interpretability, and auditability are embedded into system design, especially for enterprise and regulated deployments where model outputs and decisions must be explainable.

Example Problems You Might Work On:

RL-as-a-service using AlignTune into a production-grade internal service for supervised fine-tuning and reinforcement learning across multiple model families, datasets, and evaluation loops.
Build rollout infrastructure for new model-science capabilities so research systems can be exposed safely and incrementally inside the platform.
Integrate DLBacktrace into training and inference pipelines so model behavior can be traced, debugged, and surfaced through internal and external product surfaces.
Build inference architecture for large models and agent systems that must balance cost, performance, explainability, and runtime control.
Design distributed execution flows across clusters for long-running training, evaluation, and analysis workloads with strong guarantees around recovery and reproducibility.
Unify workflows across LLMs, agents, and tabular models without collapsing their distinct operational and scientific requirements into a one-size-fits-none abstraction.
Build the platform interfaces that let downstream teams launch, inspect, evaluate, and deploy complex model workflows without needing to reimplement research infrastructure.

Requirements:

Strong experience building and shipping complex AI / ML systems in
productionDeep backend and platform engineering experience, especially in Python, distributed services, workflow orchestration, data systems, and cloud infrastructure
Hands-on experience with one or more of: fine-tuning systems, RL pipelines, inference infrastructure, distributed training, model serving, evaluation systems
Strong understanding of the systems implications of modern model workflows across LLMs, agents, and structured / tabular model systems
Experience scaling workloads across clusters and production environments, with strong instincts around reliability, observability, and performance
Ability to work across research code, systems code, and product infrastructure without losing rigor at either layer
Strong technical judgment around the tradeoffs between model quality, infra complexity, scalability, interpretability, and operational cost

Strong Bonus Signals

Experience with alignment, interpretability, or AI safety systemsExperience with multi-cluster scheduling, inference optimization, or serving infrastructure for large models
Experience converting internal research frameworks into reusable platform capabilities
Experience debugging production failures caused by interactions between model behavior, orchestration systems, and infrastructure
Experience with agent runtimes, tool orchestration, long-horizon execution, or stateful model systems

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.