Website:
Job details:
What We're Building
Enterprise AI products make capability claims worth hundreds of millions of dollars in deal value and procurement decisions. Nobody independently verifies them.
The Caliper Lab is building the infrastructure that changes that — structured, empirical assessment of whether AI product capability claims hold up against frontier model baselines, and how durable those claims are as the frontier improves.
The output is institutional-grade research that PE investors, procurement teams, and serious practitioners rely on to make decisions. This is early-stage, which means you would be building the research function, not slotting into one.
The Problem Is Genuinely Hard and Interesting
Evaluating AI capability is not advisory judgment. It requires task design, verified ground truth construction, controlled baseline comparisons, and scoring methodology that holds up to scrutiny from both investment professionals and academics. The academic foundations exist. The independent institutional home for applying them to commercial products does not yet. We are building it.
What You Will Work On
- Design and execute benchmark evaluations of AI products against frontier model baselines
- Build and maintain verified ground truth datasets across legal, financial, and knowledge worker AI categories
- Synthesise signal from practitioner panels, public sources, and live benchmarks into published research reports
- Contribute to methodology development — how the Lab measures what it measures, and how that evolves as the field develops
- Work in the most cutting edge field of AI evaluations with genuinely some of the best research talent
This is not a literature review role. You will produce findings that go in front of PE investors and enterprise buyers.
Who We Are Looking For
- 2-4 years producing structured research on technology markets or AI systems
- Could come from a research role at IDC, Gartner, Forrester, or similar; a research or analytics function inside a consulting firm; an applied ML or data science role where you produced findings, not just models; or an independent research track with published output
- Quantitative background — statistics, mathematics, economics, engineering, or similar
What we need to see:
- You read AI evaluation methodology papers and have opinions about what they get right and wrong
- You understand what LLMs actually do well and poorly on real tasks — from working with them, not reading about it
- You can write a clear finding from a messy dataset and defend it to a sceptical audience
- You have shipped work under time pressure for a real stakeholder
The setup:
- 3 month contract to start with a monthly compensation
- Can convert to a full-time founding team role with a compensation bump-up (and/ or equity) for the right person
- Remote - work directly with the founder (Former Bain Principal, with strong Private Equity & tech advisory background) and senior methodology advisors (the most cutting edge academics in the world in the field of LLM evaluations)
To apply, please send your resume and a short write-up (300-400 words) on a specific AI product you have used for real work. What did it actually do well, what did it fail at, and how would you design a test to verify whether that failure is systematic or incidental. Please send all applications to dhruv@thecaliperlab.com
Click on Apply to know more.