Website:
talent500.com
Job details:
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
Engineer, AI (Support/Operations) at T-Mobile monitors, maintains, and optimizes production AI agents and enterprise AI systems to ensure high availability, performance, and reliability across the organization. This role involves supporting deployed AI solutions including LLM-based applications, Agentic workflows, and Retrieval-Augmented Generation (RAG) pipelines through incident response, troubleshooting, performance tuning, and operational monitoring. Working collaboratively with development teams and business stakeholders, the Engineer ensures AI systems remain healthy, secure, and continuously improved, directly contributing to T-Mobile’s mission of operational excellence, efficiency, and customer-first transformation.
MAIN RESPONSIBILITIES:
- Essential job duties which are crucial to the performance of the job. Think “Why does T-Mobile have this job? What’s the value this role brings to T-Mobile
- Monitor and maintain production AI systems to ensure high availability, optimal performance, and reliable service delivery across enterprise workflows.
- Triage and resolve incidents related to AI agent failures, model degradation, API errors, and integration issues through systematic root cause analysis.
- Manage AI system configurations, connector integrations, deployment pipelines, and access controls across enterprise AI platforms.
- Collaborate with AI development teams and business stakeholders to support deployment readiness, user onboarding, and escalation workflows.
- Develop and maintain operational runbooks, monitoring dashboards, alerting rules, and documentation for AI systems and agent workflows.
- Identify and implement operational improvements, including automation of repetitive support tasks, capacity planning, and SLA tracking for AI services.
- Also responsible for other duties/projects as assigned by business management as needed
- Bachelor's Degree plus 3 years of related work experience OR advanced degree with 1 year of related work experience OR combination of education and experience deemed equivalent
- 2-4 years of Experience in below mentioned skills and responsibilities-
- Supporting and troubleshooting production AI/ML systems, including monitoring, alerting, and incident response
- Experience with enterprise AI platform administration, configuration management, and operational tooling using Python or similar languages
- Coordinating with development and business teams to manage releases, escalations, and operational handoffs for AI solutions
- Ability to analyze system logs, metrics, and usage data to identify trends, anomalies, and optimization opportunities.
Following Knowledge and skills are required to perform this role-
- Expertise in triaging, diagnosing, and resolving production incidents for AI systems and agentic workflows.
- Expertise in triaging, diagnosing, and resolving production incidents for AI systems and agentic workflows.
- Strong knowledge of observability, logging, alerting, and performance monitoring for AI and RAG pipelines
- APIs & Tools- Experience with LLM APIs, enterprise AI platforms, connector integrations, and operational tooling (ServiceNow, Jira, etc.)
- Agile Methodologies- Experience in agile project management to facilitate effective incident response and operational improvement cycles
- Ability to align AI operations practices with enterprise reliability and service-level goals.
- Ability to identify and implement operational improvements and automation for AI support processes.
- Focus on delivering reliable AI-driven services that meet internal customer expectations and SLAs.
- Skill in managing and administering enterprise AI platforms, environments, and operational infrastructure.
- Strong skills in diagnosing and resolving production AI system issues through systematic troubleshooting.
- Min 2-4 years of experience in AI/ML operations, IT support engineering, or production systems support.
- LLM-based applications, AI agent platforms, API integrations, and monitoring/observability tools.
- Python scripting for automation (Nice to have)
- Focus on "Building Agents"
- Focus shall be on operations and continuous improvement of products (examples are Tess IT Help bot and Fetch Myhr bot). Includes agent evaluation, production monitoring (largely), incident triage, agent debugging.
TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That means that any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.
TMUS Global Solutions will never seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidate’s acceptance of a formal offer.
Click on Apply to know more.