Software Engineering Experts(SWE) – AI Evaluation Workflows

Smart Cyber Recruiter

Location: India
Job type: Full-time

Required skills

Python
acceptance criteria
automated testing
CLI
communication skills
Git
YAML

About the role

Website:
Job details:

Mercor is seeking Software Engineering Experts to help design evaluation-ready workflows for advanced AI systems. In this role, you will translate ambiguous requirements into structured, reproducible artifacts that can be automatically tested and validated.

Your work will focus on producing clear documentation and deterministic scripts that enable reliable evaluation of AI agent performance across different scenarios. This is a contract-based, outcome-oriented role with a strong emphasis on reproducibility, automation, and precise acceptance criteria.

Key Responsibilities

Translate high-level objectives into well-scoped, testable deliverables with defined inputs, outputs, and measurable success criteria.
Create structured documentation describing expected system behavior, constraints, and edge cases for reusable evaluation workflows.
Develop lightweight automation scripts to generate artifacts, validate outputs, and enforce formatting or structural requirements.
Write deterministic Python verification scripts to validate completion using file checks, directory structures, and content assertions.
Design prompts and tasks that reliably trigger intended workflows without exposing internal instructions or implementation details.
Implement robust error handling and clear failure messages in evaluation and verification tooling.
Create baseline or distractor approaches to ensure evaluations can distinguish between correct and ineffective solutions.
Maintain clean and reproducible project structures, including consistent naming conventions and version-controlled artifacts.

Required Qualifications

Strong Python programming skills (file system operations, parsing, validation, deterministic execution).
Experience with automated testing, evaluation frameworks, or QA-style verification workflows.
Familiarity with LLM prompt design and evaluation methodologies such as closed-ended tasks and reliability testing.
Experience writing structured documentation and specifications (Markdown, YAML, clearly scoped requirements).
Comfortable with developer tools and workflows including Git, CLI environments, virtual environments, and dependency management.
Strong written communication skills with the ability to translate ambiguous requirements into precise instructions.

Preferred / Bonus Skills

Knowledge of embedding-based similarity methods (e.g., cosine similarity).
Experience designing negative controls or distractor workflows for evaluation robustness.
Background working with AI evaluation pipelines, automated grading systems, or benchmark frameworks.

What You'll Work On

Documentation and scripts designed for automated AI evaluation pipelines.
Deterministic validation frameworks ensuring consistent replay and verification.
Evaluation tasks that prevent superficial shortcuts and enforce intended workflows.
Structured deliverables enabling reliable testing across multiple AI agents and scenarios.

Contract & Payment Terms

Engagement as an independent contractor.
Fully remote with flexible scheduling.
Project scope may be extended, shortened, or concluded early depending on performance and project needs.
Weekly payments via Stripe or Wise based on completed work.
Work will not involve access to confidential or proprietary information from any employer or institution.
Unfortunately, we cannot support H1-B or STEM OPT candidates at this time.

About Mercor

Mercor partners with leading AI labs and enterprises to train and improve frontier AI systems using human expertise. Our contributors collaborate with researchers and engineers to help build, evaluate, and refine next-generation AI technologies.

https://t.mercor.com/JdI36

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.