Website:
emergent.sh
Job details:
Emergent builds autonomous coding agents that replace traditional software development by generating, testing, and deploying production applications directly from plain-language intent. Our systems run in production at global scale and are used to build millions of real applications.
Since public launch, Emergent has reached $100M ARR in 8 months. 6M+ users across 190+ countries have built 6.5M+ applications on Emergent. We've raised $100M+, backed by Khosla Ventures, SoftBank, Google, Lightspeed, Prosus, Together, and Y Combinator.
We're solving the hard part of AI-driven software creation: correctness, reliability, security, and scale in real production systems. The team is built by repeat founders, Olympiad medalists, IIT & IIM alumni, and leaders from Google, Amazon, and Dropbox.
We're hiring builders who want ownership, speed, and impact at global scale.
The Role:
We're looking for a Research Engineer to characterize, measure, and advance the capabilities of our coding agents. You will turn ambiguous notions of "agent quality" into clear, defensible metrics that the team, leadership, and the field can rely on, and you will use those metrics to drive both incremental wins and moonshots in agent performance.
This is a deep-work role at the intersection of agent behavior, evaluation research, and applied training. You will define what good looks like for long-horizon coding agents, build the evaluation dataset and methodology that produces those signals, mine production data for failure modes most teams never see, and run targeted training, fine-tuning, RL, memory, and prompt-optimization experiments that translate research advances into shipped improvements. You will operate with strong independence, make hard calls in inherently subjective and probabilistic systems, and own outcomes end-to-end.
If you treat models as objects of study rather than black boxes, take pride in moving benchmark numbers with rigor, and want to apply the frontier of agent research at the scale of millions of real applications, this is your role.
What You'll Do:
- Architect the next version of the Emergent agent. Shape the core architecture and make the foundational design choices that define how the agent thinks, learns, and improves over time.
- Characterize agent behavior at depth. Develop a deep, evidence-grounded understanding of how the agent succeeds and fails across the full range of real-world usage, and convert that understanding into rigorous, quantitative measurement.
- Design and ship evaluations across reasoning, planning, tool use, code correctness, long-horizon execution, security, and agent reliability. Define the metric, build the dataset, validate against known signals, and ship dashboards that make regressions impossible to miss.
- Drive step-function gains. Take on the ambitious bets that meaningfully advance the state of the art, the 10-point leaps on hard capabilities, not incremental polish. Pick the problems where the upside is large and the path is uncertain.
- Climb public benchmarks. Move the needle on SWE-bench Pro, Terminal-Bench, and other industry-standard benchmarks the field uses to grade coding agents.
- Run training and post-training experiments, supervised fine-tuning, RLHF/RLAIF, DPO, distillation, reward modeling, prompt optimization, and judge-model calibration, against production-grounded objectives.
- Own end-to-end. Carry work from hypothesis through experiment design, execution, analysis, decision, rollout, and post-launch measurement. Read research papers deeply, get inspired ideas, and turn them into shipped outcomes.
- Make hard calls in subjective systems. Decide when a regression is real, when a win is noise, when a benchmark is overfit, when to ship despite mixed signals, and when to kill a promising direction. Communicate the reasoning crisply.
Who You Are:
- 5–8 years of AI experience, with meaningful time spent either training and fine-tuning models, or designing rigorous evaluations and measurement systems for them. Both paths are equally valued for this role.
- Hands-on with the modern AI stack and fluent in Python (Go a plus) for research workflows: training pipelines, eval harnesses, data processing, statistical analysis. Comfortable with transformers, RLHF/DPO/RL for agents, eval frameworks (Inspect, lm-eval-harness, or equivalent), prompt optimization, judge models, and agent frameworks. You pick up new tooling in days.
- Take pride in numbers that move. You measure first, opine second. You can defend why a benchmark is the right benchmark, why a metric isn't gameable, and why a result is statistically real.
- Comfortable in subjective, probabilistic systems. You reason about noise floors, confounds, distribution shift, judge bias, and selection effects without flinching. You know when to trust a number and when to suspect it.
- Enjoy going deep into the long tail. Sifting through large volumes of agent behavior to find the rare, hidden failure mode energizes you, not drains you.
- Understand models like friends. You have intuitions about how a model will behave on a new task before running it, and you update those intuitions when reality disagrees. You know what came out last week, why it matters, and which paper from two years ago is suddenly relevant again.
- Independent operator with leadership presence. You scope your own work, push back on weak ideas (including your manager's), and bring others along through clarity and conviction rather than consensus-seeking.
- Ship fast without compromising rigor. You know which corners are safe to cut and which are load-bearing. Bias toward velocity, but never at the cost of honest measurement.
Benefits and Perks:
- Daily Meals: Lunch and Dinner provided
- Family Insurance: 3 Lakhs worth of coverage for you and your family
- Unlimited Paid Time Off: Take the time you need to recharge and come back refreshed
- Flexible Working Hours: Work arrangements that fit your life and commitments
Let's build the future of software together.
Click on Apply to know more.