AI Engieer II - Evaluation & Learning

Mechademy

Location: Gurugram, Haryana, India
Job type: Full-time

Required skills

Python
Backbone
data science
Gate
Ray
regression
SQL
statistics
version control

About the role

Mechademy

Website: mechademy.com
Job details:

The Opportunity

Most ML roles ask you to build models. This one asks you to build the system that decides whether the models are actually working — in production, on live industrial equipment, where a wrong answer has real operational consequences.

Mechademy's AI agents diagnose equipment faults, recommend actions, and route critical events. The question nobody has rigorously answered yet is: are they getting better? You will own that answer. You will design the evaluation frameworks, build the RLHF data quality layer, and create the metrics that turn "we think the AI is improving" into "we can prove it, sprint over sprint." This is the measurement backbone the entire AI system depends on — and it does not exist yet. You are building it.

About Mechademy

Mechademy builds an enterprise AI platform for real-time monitoring, diagnostics, and predictive maintenance of industrial equipment. Under the hood: physics-informed ML, custom AutoML, distributed training with Ray, and production AI running on every deployment.

45+ employees across New Delhi and Houston
Venture-funded, closing Series A | 80–90% YoY growth
15+ enterprise clients across oil & gas, power generation, and LNG
Physics-informed ML + production LLM/VLM: a 12–18 month lead over the field

What You'll Own

Training Data Quality & Feedback Pipelines (35%)

You own the data quality layer that determines whether the AI learns the right things from expert feedback — designing how it's captured, cleaning it, structuring it, and measuring whether it's actually improving the system.

Design and maintain the labeling schema for domain engineer feedback: what gets annotated, how, and at what quality threshold
Build pipelines that capture, clean, and structure monitoring session reasoning traces as training data
Define inter-annotator agreement standards and audit label quality on a regular cadence
Partner with the engineering team building the feedback collection mechanism to ensure captured data is structured for training
Curate and maintain held-out evaluation sets that measure agent progress across sprints

Evaluation Framework Design (30%)

You define what "the agent is getting better" means — in numbers, not opinions.

Design domain-appropriate evaluation metrics for industrial diagnostic AI that go beyond generic accuracy and F1
Build regression test suites for the AI diagnostic agent: known-answer cases that catch model regressions before production deployment
Work with domain engineers to establish ground truth for ambiguous events and document expert disagreements as calibration data
Define the evaluation protocol for new agent capabilities before they reach production — the quality gate, not an afterthought

Intelligent Event Routing — ML Model (20%)

Anomaly events are classified and routed before they reach the AI layer. As the platform matures, this routing needs to learn from historical feedback. You build and own that model.

Own the ML model that routes equipment anomaly events, trained on labeled historical examples, improving as feedback accumulates
Feature engineering from sensor signals, domain-derived features, and historical event outcomes
Own the full model lifecycle: training, evaluation, A/B testing against the deterministic baseline, deployment via the squad's review process

AI Quality Analytics & Reporting (15%)

Make the flywheel visible — to your squad, to leadership, and to the engineers making deployment decisions.

Own the weekly AI quality sync with squad leadership and the Director of Data Science and AI: bring data on training pipeline health, label quality, and agent performance trends
Establish and maintain OKR baselines for the monitoring sub-stream: define the metrics, set the baseline, track improvement
Build the internal views that tell the squad whether the feedback-to-training loop is functioning
Translate model quality signals into plain-language guidance for engineering decisions

What Success Looks Like

First 30 days: You have a deep understanding of the existing feedback collection pipeline, the event classification system, and how domain engineers actually reason when diagnosing equipment events. Your first pass at a labeling schema and evaluation framework is proposed and under review.

First 60 days: Evaluation benchmarks are defined for the AI diagnostic agent. RLHF label quality baseline is established. OKR metrics for the monitoring sub-stream are agreed, instrumented, and being tracked weekly.

First 180 days: The Intelligent Event Routing ML model is live and measurably outperforming the deterministic baseline on held-out test cases. The RLHF pipeline is producing high-quality, consistently structured training data. Every agent deployment decision is backed by an eval score you own.

Your success metric: By month twelve, the diagnostic agent's correctness rate is measurably and continuously improving sprint-over-sprint, the RLHF flywheel is self-sustaining, and the question "is the AI actually getting better?" has a precise, data-backed answer every week.

Who You Are

Must-Have

4+ years in ML or data science, with at least 1–2 years where your primary output was model quality, evaluation design, or training data infrastructure — not reports or dashboards
Evaluation design depth — you know that "87% on the test set" is not the end of the conversation. You build evaluation sets that catch real failure modes, measure calibration, and separate genuine improvement from distributional shift
Comfort with ambiguous ground truth — you have worked in domains where "correct" requires expert judgment and you know how to handle that systematically
Strong Python ML stack — scikit-learn, HuggingFace ecosystem, production-grade data pipelines. Not just notebooks — clean, versioned, reproducible code
Communication that translates — you can explain model quality to engineers who do not speak ML and to leadership who need deployment confidence. You present data weekly, not quarterly

Strong Signals

RLHF or human-in-the-loop ML experience — you have designed annotation schemas, managed labeling quality, or built preference learning pipelines. You understand the failure modes: label noise, annotator drift, reward hacking
LLM evaluation frameworks (RAGAS, LLM-as-judge, DeepEval, or custom evals you built from scratch)
Annotation tooling experience (Label Studio, Argilla, or similar) — especially for structured or multi-step outputs
Fine-tuning experience on domain-specific datasets — even once, even small scale
MLflow, Weights & Biases, or similar experiment tracking in production workflows
SQL proficiency for querying operational and event history data
Experience embedded in a software engineering squad (not a pure data science team) — you write production-quality code, use version control, and understand what "shippable" means
Sensor data or time-series background — if you understand how real-world signals behave, your ramp is significantly compressed

The Right Mindset

Measurement-obsessed — you get uncomfortable when a model decision is made on vibes. If you cannot measure it, you cannot improve it, and that bothers you.
Domain-curious — you do not need industry knowledge on day one, but you find it interesting that the features feeding our models are grounded in real physics and engineering principles, not learned from data alone. The domain is learnable — you will pick up industry context faster here than anywhere else.
Engineering-minded — you are embedded in a software engineering squad. Clean code, version control, PR reviews, and production standards are the default, not the exception.
Close to the people who know — the domain engineers are your most important data source. Getting their reasoning captured efficiently without burning their time is a core job competency here.

Why Mechademy

The data is irreplaceable. Years of expert engineer reasoning traces, equipment event histories across 15+ enterprise clients, and physics-grounded feature sets. This training corpus does not exist anywhere else. When a model is fine-tuned on it, it learns domain knowledge no foundation model has ever seen.

The evaluation problem is genuinely hard. How do you measure whether an AI diagnosing industrial equipment is correct? Ground truth requires domain expertise. Failure has real operational consequences. Most ML scientists never encounter evaluation challenges at this level of difficulty — here, it is the core of the role.

The flywheel is real. The more the AI is used in production, the more feedback accumulates, the better the model gets — and that loop is self-reinforcing. You are not building a one-off model. You are building a system that gets smarter with every deployment at every client site. That is a fundamentally different kind of ML work.

The upside is real. Closing Series A, 90% YoY growth, 15+ enterprise clients and growing. Competitive compensation — we want the people building this to share in what they create.

What We Offer

Production AI infrastructure from day one: RLHF pipeline, LangGraph, Langfuse, Ray, MLflow, pgvector
Direct access to domain engineers with decades of industrial equipment pattern recognition — expertise that does not exist in the market
Real evaluation challenges: AI systems where wrong answers have operational consequences and ground truth requires expert judgment
An AI Engineer (also being hired for this squad) who will work alongside you on the platform's AI layer — you bring the science, they bring the systems
Hybrid flexibility: 2–3 days on-site in Gurugram, rest remote
Competitive compensation

Qualifications

B.Tech / B.E. / B.S. in Computer Science, Statistics, Mathematics, Engineering, or a related discipline
Master's preferred but track record matters more than credentials — show us your evaluation frameworks, not your GPA
Startup or high-growth environment experience strongly preferred

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.