Mechademy
Website:
mechademy.com
Job details:
The Opportunity
Most ML roles ask you to build models. This one asks you to build the system that decides whether the models are actually working — in production, on live industrial equipment, where a wrong answer has real operational consequences.
Mechademy's AI agents diagnose equipment faults, recommend actions, and route critical events. The question nobody has rigorously answered yet is: are they getting better? You will own that answer. You will design the evaluation frameworks, build the RLHF data quality layer, and create the metrics that turn "we think the AI is improving" into "we can prove it, sprint over sprint." This is the measurement backbone the entire AI system depends on — and it does not exist yet. You are building it.
About Mechademy
Mechademy builds an enterprise AI platform for real-time monitoring, diagnostics, and predictive maintenance of industrial equipment. Under the hood: physics-informed ML, custom AutoML, distributed training with Ray, and production AI running on every deployment.
- 45+ employees across New Delhi and Houston
- Venture-funded, closing Series A | 80–90% YoY growth
- 15+ enterprise clients across oil & gas, power generation, and LNG
- Physics-informed ML + production LLM/VLM: a 12–18 month lead over the field
What You'll Own
Training Data Quality & Feedback Pipelines (35%)
You own the data quality layer that determines whether the AI learns the right things from expert feedback — designing how it's captured, cleaning it, structuring it, and measuring whether it's actually improving the system.
- Design and maintain the labeling schema for domain engineer feedback: what gets annotated, how, and at what quality threshold
- Build pipelines that capture, clean, and structure monitoring session reasoning traces as training data
- Define inter-annotator agreement standards and audit label quality on a regular cadence
- Partner with the engineering team building the feedback collection mechanism to ensure captured data is structured for training
- Curate and maintain held-out evaluation sets that measure agent progress across sprints
Evaluation Framework Design (30%)
You define what "the agent is getting better" means — in numbers, not opinions.
- Design domain-appropriate evaluation metrics for industrial diagnostic AI that go beyond generic accuracy and F1
- Build regression test suites for the AI diagnostic agent: known-answer cases that catch model regressions before production deployment
- Work with domain engineers to establish ground truth for ambiguous events and document expert disagreements as calibration data
- Define the evaluation protocol for new agent capabilities before they reach production — the quality gate, not an afterthought
Intelligent Event Routing — ML Model (20%)
Anomaly events are classified and routed before they reach the AI layer. As the platform matures, this routing needs to learn from historical feedback. You build and own that model.
- Own the ML model that routes equipment anomaly events, trained on labeled historical examples, improving as feedback accumulates
- Feature engineering from sensor signals, domain-derived features, and historical event outcomes
- Own the full model lifecycle: training, evaluation, A/B testing against the deterministic baseline, deployment via the squad's review process
AI Quality Analytics & Reporting (15%)
Make the flywheel visible — to your squad, to leadership, and to the engineers making deployment decisions.
- Own the weekly AI quality sync with squad leadership and the Director of Data Science and AI: bring data on training pipeline health, label quality, and agent performance trends
- Establish and maintain OKR baselines for the monitoring sub-stream: define the metrics, set the baseline, track improvement
- Build the internal views that tell the squad whether the feedback-to-training loop is functioning
- Translate model quality signals into plain-language guidance for engineering decisions
What Success Looks Like
First 30 days: You have a deep understanding of the existing feedback collection pipeline, the event classification system, and how domain engineers actually reason when diagnosing equipment events. Your first pass at a labeling schema and evaluation framework is proposed and under review.
First 60 days: Evaluation benchmarks are defined for the AI diagnostic agent. RLHF label quality baseline is established. OKR metrics for the monitoring sub-stream are agreed, instrumented, and being tracked weekly.
First 180 days: The Intelligent Event Routing ML model is live and measurably outperforming the deterministic baseline on held-out test cases. The RLHF pipeline is producing high-quality, consistently structured training data. Every agent deployment decision is backed by an eval score you own.
Your success metric: By month twelve, the diagnostic agent's correctness rate is measurably and continuously improving sprint-over-sprint, the RLHF flywheel is self-sustaining, and the question "is the AI actually getting better?" has a precise, data-backed answer every week.
Who You Are
Must-Have
- 4+ years in ML or data science, with at least 1–2 years where your primary output was model quality, evaluation design, or training data infrastructure — not reports or dashboards
- Evaluation design depth — you know that "87% on the test set" is not the end of the conversation. You build evaluation sets that catch real failure modes, measure calibration, and separate genuine improvement from distributional shift
- Comfort with ambiguous ground truth — you have worked in domains where "correct" requires expert judgment and you know how to handle that systematically
- Strong Python ML stack — scikit-learn, HuggingFace ecosystem, production-grade data pipelines. Not just notebooks — clean, versioned, reproducible code
- Communication that translates — you can explain model quality to engineers who do not speak ML and to leadership who need deployment confidence. You present data weekly, not quarterly
Strong Signals
- RLHF or human-in-the-loop ML experience — you have designed annotation schemas, managed labeling quality, or built preference learning pipelines. You understand the failure modes: label noise, annotator drift, reward hacking
- LLM evaluation frameworks (RAGAS, LLM-as-judge, DeepEval, or custom evals you built from scratch)
- Annotation tooling experience (Label Studio, Argilla, or similar) — especially for structured or multi-step outputs
- Fine-tuning experience on domain-specific datasets — even once, even small scale
- MLflow, Weights & Biases, or similar experiment tracking in production workflows
- SQL proficiency for querying operational and event history data
- Experience embedded in a software engineering squad (not a pure data science team) — you write production-quality code, use version control, and understand what "shippable" means
- Sensor data or time-series background — if you understand how real-world signals behave, your ramp is significantly compressed
The Right Mindset
- Measurement-obsessed — you get uncomfortable when a model decision is made on vibes. If you cannot measure it, you cannot improve it, and that bothers you.
- Domain-curious — you do not need industry knowledge on day one, but you find it interesting that the features feeding our models are grounded in real physics and engineering principles, not learned from data alone. The domain is learnable — you will pick up industry context faster here than anywhere else.
- Engineering-minded — you are embedded in a software engineering squad. Clean code, version control, PR reviews, and production standards are the default, not the exception.
- Close to the people who know — the domain engineers are your most important data source. Getting their reasoning captured efficiently without burning their time is a core job competency here.
Why Mechademy
The data is irreplaceable. Years of expert engineer reasoning traces, equipment event histories across 15+ enterprise clients, and physics-grounded feature sets. This training corpus does not exist anywhere else. When a model is fine-tuned on it, it learns domain knowledge no foundation model has ever seen.
The evaluation problem is genuinely hard. How do you measure whether an AI diagnosing industrial equipment is correct? Ground truth requires domain expertise. Failure has real operational consequences. Most ML scientists never encounter evaluation challenges at this level of difficulty — here, it is the core of the role.
The flywheel is real. The more the AI is used in production, the more feedback accumulates, the better the model gets — and that loop is self-reinforcing. You are not building a one-off model. You are building a system that gets smarter with every deployment at every client site. That is a fundamentally different kind of ML work.
The upside is real. Closing Series A, 90% YoY growth, 15+ enterprise clients and growing. Competitive compensation — we want the people building this to share in what they create.
What We Offer
- Production AI infrastructure from day one: RLHF pipeline, LangGraph, Langfuse, Ray, MLflow, pgvector
- Direct access to domain engineers with decades of industrial equipment pattern recognition — expertise that does not exist in the market
- Real evaluation challenges: AI systems where wrong answers have operational consequences and ground truth requires expert judgment
- An AI Engineer (also being hired for this squad) who will work alongside you on the platform's AI layer — you bring the science, they bring the systems
- Hybrid flexibility: 2–3 days on-site in Gurugram, rest remote
- Competitive compensation
Qualifications
- B.Tech / B.E. / B.S. in Computer Science, Statistics, Mathematics, Engineering, or a related discipline
- Master's preferred but track record matters more than credentials — show us your evaluation frameworks, not your GPA
- Startup or high-growth environment experience strongly preferred
Click on Apply to know more.