Website:
recrew.ai
Job details:
Role: Agentic Platform Architect – Shop OS & Runtime
Function: AI Infrastructure / Backend Engineering / Distributed Systems
Location: Bangalore
Type: Full-time
Industry: AI / Agentic Systems / Commerce / Payments / Logistics / SMB Technology
About Company
The company is building a foundational AI agent platform at India scale. It powers real-world outcomes for SMBs, SMEs, and MSMEs across commerce, payments, and logistics.
The platform interprets intent through voice and context, autonomously matching demand with fulfilment. It executes outcomes across physical and digital systems—eliminating dashboards and operational friction.
Backed by large-scale national infrastructure and distribution, this is a 0→1 build with startup intensity. It is a once-in-a-generation opportunity to bring millions of Indian businesses into the AI-driven economy.
Position Overview
We are looking for Agentic Platform Architects to design and build the core runtime powering AI agents within Shop OS. This role sits at the intersection of distributed systems and AI infrastructure, focused on how agents execute, coordinate, and scale reliably across millions of shops. You will own the stability, observability, and cost-efficiency of production-grade agentic systems—not prototypes.
Role & Responsibilities
- Design and build the core agent execution runtime powering Shop OS at India scale
- Develop hybrid workflows combining deterministic logic with LLM-driven decision-making
- Optimize inference orchestration pipelines for cost, latency, and throughput at scale
- Architect event-driven, fault-tolerant systems capable of serving millions of concurrent users
- Build deep observability layers—tracing, alerting, and monitoring for agentic workflows
- Implement security, compliance, and data integrity practices across the agent runtime
- Define system architecture standards and make durable infrastructure decisions with long-term consequences
Must Have Criteria
- 7–12 years of backend engineering experience with 3+ years building distributed systems in production
- Hands-on experience building and deploying LLM-based systems or AI agent infrastructure in production (not POCs)
- Proficiency in Java or Python for building high-throughput, low-latency backend services
- Experience with event-driven architectures using Apache Kafka at production scale
- Experience with workflow orchestration engines such as Temporal for managing complex, long-running processes
- Demonstrated experience building systems at India scale—10M+ users or equivalent transaction volume
- Strong observability engineering experience: distributed tracing, metrics pipelines, and alerting systems (e.g., OpenTelemetry, Prometheus, Grafana)
Nice to Have
- Experience with model serving infrastructure (e.g., Triton, vLLM, or Ray Serve) for optimizing LLM inference costs
- Hands-on experience with vector databases (e.g., Weaviate, Pinecone, or pgvector) and RAG pipeline design
- Open-source contributions to distributed systems, AI infrastructure, or developer tooling
- Prior experience at a high-scale Indian tech company (e.g., Flipkart, Meesho, PhonePe, Razorpay, or similar)
- Experience building agentic systems with tool-calling, multi-agent coordination, or autonomous workflow execution
What We Offer
- End-to-end ownership of foundational AI infrastructure with national-scale impact
- Work on cutting-edge agentic systems backed by the company's distribution and infrastructure
- Small, high-ownership teams with minimal hierarchy and direct access to decision-making
- Rare opportunity to define core architectures at the 0→1 stage of a platform used by millions
- Competitive compensation with the scope and responsibility of a senior technical leader
Click on Apply to know more.