Principal Backend Engineer

ALOIS Solutions

full-time

Required skills

Python
Airflow
AWS
API
backend
CloudWatch
compliance
Docker
EC2
ECS
end-to-end
FastAPI
Kubernetes
microservices
multi-tenant
OAuth2
PostgreSQL
SaaS
SQLAlchemy
Pytorch

About the role

ALOIS Solutions

Website: aloissolutions.com
Job details:

Job Title: Principal Backend Engineer

Location: India (Remote)

Job type: Full-time - Permanent

ROLE OVERVIEW

We are seeking a Principal Backend Engineer to own the scalability, reliability, and architectural evolution of an AI platform as it transitions from a prototype to a production-grade platform. This is a high-impact, high-ownership role at the intersection of distributed systems engineering and applied AI.

KEY RESPONSIBILITIES

Platform Architecture & Scalability: Own end-to-end backend architecture for the AI platform; design for multi-tenant, high-throughput extraction workloads
Extraction Pipeline Engineering: Architect and evolve AI pipelines, define pipeline abstractions, registry patterns, and execution strategies that balance accuracy, latency, and LLM API cost.
Multi-Provider LLM Abstraction: Enhance and harden the unified LLM provider layer; implement multi-key round-robin pooling, structured output schemas, tool-calling protocols, and provider failover logic.
Async Worker & Queue Architecture: Design the distributed worker system for heavy pipeline execution, observable queue with dead-letter handling, retry policies, and backpressure management.
AWS Production Deployment: Lead the AWS deployment architecture (ECS/EC2, RDS PostgreSQL, S3, Bedrock, ALB, Secrets Manager, CloudWatch); define IaC, blue-green deployment strategies, and ensure the platform meets security, compliance, and data residency requirements.
API Design & Governance: Define and enforce API design standards (OpenAPI 3.1, spec-first, versioning, deprecation); own the 25+ FastAPI endpoints, request/response schema evolution, and backward compatibility guarantees.
Data Architecture: Own the PostgreSQL schema for extraction metadata (holes, dimensions, GD&T, title blocks, notes, extraction traces, batches); design indexing strategies, multi-tenant isolation, and efficient querying patterns for the document review and dataset export workflows.
Security & Compliance: Implement secure service-to-service communication, secrets management via AWS Secrets Manager, IAM role-based access for Bedrock and S3, and CORS/auth policies
Observability & Reliability: Instrument the platform with structured logging, distributed tracing (AWS X-Ray / OpenTelemetry), and CloudWatch alarms; define SLOs for pipeline throughput, LLM call latency, and extraction accuracy.
Engineering Leadership: Mentor senior engineers, conduct architecture and design reviews, set coding standards, and drive the technical roadmap for the enterprise.

REQUIRED SKILLS & EXPERIENCE

10+ years of backend engineering experience with a strong focus on distributed systems and production-grade platform design.
Expert-level Python proficiency; deep hands-on experience with FastAPI, SQLAlchemy, Pydantic v2, and async/concurrent programming (asyncio, ThreadPoolExecutor).
Proven experience designing and operating microservices at scale — including service decomposition, inter-service communication, and failure isolation strategies.
Strong PostgreSQL expertise: schema design, indexing, query optimisation, multi-tenant isolation (RLS or schema-per-tenant), and migration management.
Hands-on AWS production experience: EC2/ECS, RDS, S3, ALB, IAM, Secrets Manager, CloudWatch, and ideally Amazon Bedrock.
Deep understanding of event-driven and async architectures: message queues, polling workers, idempotency guarantees, retry strategies, and backpressure handling.
Experience integrating LLM APIs (any of: Anthropic Claude, OpenAI, Google Gemini, Mistral, AWS Bedrock) in production — including rate limiting, structured output enforcement, and multi-provider failover.
Experience with container-based deployments using Docker and Docker Compose; familiarity with ECS task definitions and service orchestration.
Strong API lifecycle management skills: OpenAPI 3.1, versioning, backward compatibility, deprecation policies, and governance frameworks.
Practical knowledge of distributed system patterns: sagas, circuit breakers, idempotency keys, at-least-once delivery, and graceful degradation.
Security engineering fundamentals: OAuth2, JWT, RBAC, CORS, secrets management, and cloud IAM role design.
Prior experience building developer platforms, internal tooling, or AI/ML serving platforms at large-scale organisations.

PREFERRED / GOOD TO HAVE

Direct experience building AI/LLM extraction pipelines — structured output, tool-calling, multi-agent orchestration (LangGraph, CrewAI, or custom frameworks).
Familiarity with agentic pipeline patterns: planner-executor swarms, multi-round tool-use loops, shared notepad patterns, and confidence-based validation.
Experience with Amazon Bedrock cross-region inference and IAM-based model access (Amazon Nova, Claude via Bedrock).
Knowledge of ML model serving infrastructure: HuggingFace Transformers, PyTorch model loading, YOLO/object detection integration, ChromaDB vector stores.
Experience with PDF processing pipelines: rasterisation (PyMuPDF), coordinate normalisation, multi-page extraction, and bounding box annotation workflows.
Familiarity with service mesh (Istio/Linkerd), Kubernetes, or EKS for cloud-native deployment evolution.
Experience with workflow orchestration frameworks (Temporal, Airflow, or equivalent) for long-running, stateful extraction jobs.
Experience in high-scale SaaS platforms with strict SLAs, multi-region deployments, and enterprise security requirements.

SUMMARY

This role is for a Principal-level engineer who can own backend architecture at the intersection of large-scale distributed systems and applied AI. The ideal candidate has built production platforms that serve real enterprise workloads, has deep LLM integration experience, and can lead a team while staying hands-on in the most critical technical decisions.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.