Website:
Job details:
Mercor is seeking Software Engineering Experts to help design evaluation-ready workflows for advanced AI systems. In this role, you will translate ambiguous requirements into structured, reproducible artifacts that can be automatically tested and validated.
Your work will focus on producing clear documentation and deterministic scripts that enable reliable evaluation of AI agent performance across different scenarios. This is a contract-based, outcome-oriented role with a strong emphasis on reproducibility, automation, and precise acceptance criteria.
Key Responsibilities
- Translate high-level objectives into well-scoped, testable deliverables with defined inputs, outputs, and measurable success criteria.
- Create structured documentation describing expected system behavior, constraints, and edge cases for reusable evaluation workflows.
- Develop lightweight automation scripts to generate artifacts, validate outputs, and enforce formatting or structural requirements.
- Write deterministic Python verification scripts to validate completion using file checks, directory structures, and content assertions.
- Design prompts and tasks that reliably trigger intended workflows without exposing internal instructions or implementation details.
- Implement robust error handling and clear failure messages in evaluation and verification tooling.
- Create baseline or distractor approaches to ensure evaluations can distinguish between correct and ineffective solutions.
- Maintain clean and reproducible project structures, including consistent naming conventions and version-controlled artifacts.
Required Qualifications
- Strong Python programming skills (file system operations, parsing, validation, deterministic execution).
- Experience with automated testing, evaluation frameworks, or QA-style verification workflows.
- Familiarity with LLM prompt design and evaluation methodologies such as closed-ended tasks and reliability testing.
- Experience writing structured documentation and specifications (Markdown, YAML, clearly scoped requirements).
- Comfortable with developer tools and workflows including Git, CLI environments, virtual environments, and dependency management.
- Strong written communication skills with the ability to translate ambiguous requirements into precise instructions.
Preferred / Bonus Skills
- Knowledge of embedding-based similarity methods (e.g., cosine similarity).
- Experience designing negative controls or distractor workflows for evaluation robustness.
- Background working with AI evaluation pipelines, automated grading systems, or benchmark frameworks.
What You'll Work On
- Documentation and scripts designed for automated AI evaluation pipelines.
- Deterministic validation frameworks ensuring consistent replay and verification.
- Evaluation tasks that prevent superficial shortcuts and enforce intended workflows.
- Structured deliverables enabling reliable testing across multiple AI agents and scenarios.
Contract & Payment Terms
- Engagement as an independent contractor.
- Fully remote with flexible scheduling.
- Project scope may be extended, shortened, or concluded early depending on performance and project needs.
- Weekly payments via Stripe or Wise based on completed work.
- Work will not involve access to confidential or proprietary information from any employer or institution.
- Unfortunately, we cannot support H1-B or STEM OPT candidates at this time.
About Mercor
Mercor partners with leading AI labs and enterprises to train and improve frontier AI systems using human expertise. Our contributors collaborate with researchers and engineers to help build, evaluate, and refine next-generation AI technologies.
https://t.mercor.com/JdI36
Click on Apply to know more.