ALOIS Solutions
Website:
aloissolutions.com
Job details:
Job Title: Principal Backend Engineer
Location: India (Remote)
Job type: Full-time - Permanent
ROLE OVERVIEW
We are seeking a Principal Backend Engineer to own the scalability, reliability, and architectural evolution of an AI platform as it transitions from a prototype to a production-grade platform. This is a high-impact, high-ownership role at the intersection of distributed systems engineering and applied AI.
KEY RESPONSIBILITIES
- Platform Architecture & Scalability: Own end-to-end backend architecture for the AI platform; design for multi-tenant, high-throughput extraction workloads
- Extraction Pipeline Engineering: Architect and evolve AI pipelines, define pipeline abstractions, registry patterns, and execution strategies that balance accuracy, latency, and LLM API cost.
- Multi-Provider LLM Abstraction: Enhance and harden the unified LLM provider layer; implement multi-key round-robin pooling, structured output schemas, tool-calling protocols, and provider failover logic.
- Async Worker & Queue Architecture: Design the distributed worker system for heavy pipeline execution, observable queue with dead-letter handling, retry policies, and backpressure management.
- AWS Production Deployment: Lead the AWS deployment architecture (ECS/EC2, RDS PostgreSQL, S3, Bedrock, ALB, Secrets Manager, CloudWatch); define IaC, blue-green deployment strategies, and ensure the platform meets security, compliance, and data residency requirements.
- API Design & Governance: Define and enforce API design standards (OpenAPI 3.1, spec-first, versioning, deprecation); own the 25+ FastAPI endpoints, request/response schema evolution, and backward compatibility guarantees.
- Data Architecture: Own the PostgreSQL schema for extraction metadata (holes, dimensions, GD&T, title blocks, notes, extraction traces, batches); design indexing strategies, multi-tenant isolation, and efficient querying patterns for the document review and dataset export workflows.
- Security & Compliance: Implement secure service-to-service communication, secrets management via AWS Secrets Manager, IAM role-based access for Bedrock and S3, and CORS/auth policies
- Observability & Reliability: Instrument the platform with structured logging, distributed tracing (AWS X-Ray / OpenTelemetry), and CloudWatch alarms; define SLOs for pipeline throughput, LLM call latency, and extraction accuracy.
- Engineering Leadership: Mentor senior engineers, conduct architecture and design reviews, set coding standards, and drive the technical roadmap for the enterprise.
REQUIRED SKILLS & EXPERIENCE
- 10+ years of backend engineering experience with a strong focus on distributed systems and production-grade platform design.
- Expert-level Python proficiency; deep hands-on experience with FastAPI, SQLAlchemy, Pydantic v2, and async/concurrent programming (asyncio, ThreadPoolExecutor).
- Proven experience designing and operating microservices at scale — including service decomposition, inter-service communication, and failure isolation strategies.
- Strong PostgreSQL expertise: schema design, indexing, query optimisation, multi-tenant isolation (RLS or schema-per-tenant), and migration management.
- Hands-on AWS production experience: EC2/ECS, RDS, S3, ALB, IAM, Secrets Manager, CloudWatch, and ideally Amazon Bedrock.
- Deep understanding of event-driven and async architectures: message queues, polling workers, idempotency guarantees, retry strategies, and backpressure handling.
- Experience integrating LLM APIs (any of: Anthropic Claude, OpenAI, Google Gemini, Mistral, AWS Bedrock) in production — including rate limiting, structured output enforcement, and multi-provider failover.
- Experience with container-based deployments using Docker and Docker Compose; familiarity with ECS task definitions and service orchestration.
- Strong API lifecycle management skills: OpenAPI 3.1, versioning, backward compatibility, deprecation policies, and governance frameworks.
- Practical knowledge of distributed system patterns: sagas, circuit breakers, idempotency keys, at-least-once delivery, and graceful degradation.
- Security engineering fundamentals: OAuth2, JWT, RBAC, CORS, secrets management, and cloud IAM role design.
- Prior experience building developer platforms, internal tooling, or AI/ML serving platforms at large-scale organisations.
PREFERRED / GOOD TO HAVE
- Direct experience building AI/LLM extraction pipelines — structured output, tool-calling, multi-agent orchestration (LangGraph, CrewAI, or custom frameworks).
- Familiarity with agentic pipeline patterns: planner-executor swarms, multi-round tool-use loops, shared notepad patterns, and confidence-based validation.
- Experience with Amazon Bedrock cross-region inference and IAM-based model access (Amazon Nova, Claude via Bedrock).
- Knowledge of ML model serving infrastructure: HuggingFace Transformers, PyTorch model loading, YOLO/object detection integration, ChromaDB vector stores.
- Experience with PDF processing pipelines: rasterisation (PyMuPDF), coordinate normalisation, multi-page extraction, and bounding box annotation workflows.
- Familiarity with service mesh (Istio/Linkerd), Kubernetes, or EKS for cloud-native deployment evolution.
- Experience with workflow orchestration frameworks (Temporal, Airflow, or equivalent) for long-running, stateful extraction jobs.
- Experience in high-scale SaaS platforms with strict SLAs, multi-region deployments, and enterprise security requirements.
SUMMARY
This role is for a Principal-level engineer who can own backend architecture at the intersection of large-scale distributed systems and applied AI. The ideal candidate has built production platforms that serve real enterprise workloads, has deep LLM integration experience, and can lead a team while staying hands-on in the most critical technical decisions.
Click on Apply to know more.