Benchmark Provenance

Benchmark suites usually need more than a prompt and a score. They carry source pins, task patches, generated dataset rows, oracle data, setup scripts, and verification commands. AgentV represents that with existing primitives:

Put runtime behavior in workspace, execution, input, expected_output, and assertions.
Put provenance and classification in per-case metadata.
Put bulky per-case authoring inputs in optional case directories and supporting files.
Use generated run folders, not hand-authored source bundles, as the portable audit artifact.

These are documentation patterns, not special runtime schema keys. AgentV does not interpret keys such as source_commit, test_patch, or question_type unless your hook or custom assertion reads them.

Operational vs Informational Fields

Use this split when deciding where a benchmark key belongs:

Field area	Operational?	What AgentV does
`workspace.repos[]`	Yes	Declares repo identity and checkout refs; AgentV resolves acquisition and materializes the checkout.
`workspace.template`	Yes	Copies a workspace template into the run workspace.
`workspace.hooks`	Yes	Runs lifecycle commands with workspace and case context on stdin.
`workspace.isolation`, `workspace.mode`, `workspace.path`	Yes	Controls workspace reuse and materialization.
`execution`	Yes	Selects targets, thresholds, dependencies, and default grader behavior.
`input`, `input_files`, `expected_output`	Yes	Builds the target prompt and passive reference answer.
`assertions`	Yes	Runs deterministic, LLM, composite, or code graders.
Top-level `name`, `version`, `tags`, `license`, `requires`	Informational	Identifies and categorizes the suite.
`tests[].metadata`	Informational to AgentV	Passes arbitrary case data through to results and hook stdin; in-process custom assertions can also read it.

metadata can still become operational inside your own hook scripts. For example, a before_each hook can read case_metadata.test_patch and apply that patch before the agent starts. The distinction is that AgentV itself only passes the metadata along; the script owns the behavior.

Hook Payloads

Lifecycle hooks receive JSON on stdin. Case-scoped hooks such as per-test before_all, before_each, and after_each receive the current test’s metadata as case_metadata:

{
  "workspace_path": "/home/user/.agentv/workspaces/run-123/case-01",
  "test_id": "case-01",
  "eval_run_id": "run-123",
  "case_input": "Fix the bug",
  "case_metadata": {
    "source_commit": "4f3e2d1",
    "test_patch": "cases/case-01/test.patch"
  }
}

Suite-level before_all hooks run once for the workspace, before any one test is selected, so they should do suite setup only. Use before_each when setup depends on per-case metadata such as a patch path, source row, or selected test list.

Task Artifact Anatomy

Benchmark task packs map cleanly onto AgentV fields at authoring time:

Task artifact	AgentV pattern
Prompt or instruction	`input`, usually with `type: file` blocks for long prompts
Source checkout	`workspace.repos[].repo` and `workspace.repos[].commit`
Per-case setup	`workspace.hooks.before_each` reading `case_metadata`
Gold answer	`expected_output` when the answer is passive reference data
Active verification	`assertions`, especially `code-grader` for commands or artifact checks
Provenance	`tests[].metadata` with source pins, generator rows, and curation labels
Bulky task files	Optional `tests: ./cases/` with per-case directories and supporting files

Use this separation only when it makes the source eval easier to maintain. It is not a first-class artifact schema. After an eval runs, AgentV writes the portable audit surface into the generated run folder: each result can link from index.jsonl to a run-local task/ bundle containing EVAL.yaml, targets.yaml, and copied files/ or graders/ snapshots where applicable. Review, Dashboard files views, and rerun workflows should inspect those generated run artifacts instead of requiring authors to maintain a parallel source-side bundle layout. See Generated Task Bundles.

SWE-Style Case

A SWE-style benchmark usually needs a source repo, a commit pin, a patch that adds or selects tests, and a list of failing tests that should pass after the agent’s fix. Keep the checkout operational under workspace.repos; keep the benchmark provenance and per-case test selectors in metadata.

name: swe-style-regression
description: Regression tasks against pinned source commits.

workspace:
  isolation: per_test
  repos:
    - path: ./repo
      repo: https://github.com/example/widget.git
      commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123
  hooks:
    before_each:
      command: ["python", "./scripts/apply-test-patch.py"]
      timeout_ms: 120000
    after_each:
      reset: strict

assertions:
  - name: focused-tests
    type: code-grader
    command: ["python", "./graders/run-focused-tests.py"]
    required: true

tests:
  - id: widget-1234
    criteria: Fix the widget parser regression without breaking existing behavior.
    input: |
      Work in repo/. Fix the parser regression described by the failing tests.
      Do not change unrelated public APIs.
    metadata:
      repo_url: https://github.com/example/widget.git
      source_commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123
      test_patch: cases/widget-1234/test.patch
      fail_to_pass_tests:
        - tests/parser.test.ts::handles-empty-widget
        - tests/parser.test.ts::preserves-widget-id

In this example, workspace.repos[].commit is the actual checkout. The matching metadata.source_commit is audit data that gets recorded with the case and is available to scripts. apply-test-patch.py can read case_metadata.test_patch and case_metadata.fail_to_pass_tests, then apply the patch and write the selected test list into the workspace. The code grader can read that workspace file through its workspace_path payload. Repo acquisition remains outside the eval; use registered projects or git_cache.mirrors when a local machine needs faster large-repo setup. See Workspace Architecture.

Native AgentV vs Harbor-backed Benchmarks

Use native AgentV workspaces for repo-backed evals where AgentV should own the run lifecycle: materialize generic repos, run targets, execute hooks and graders, gate CI, and write AgentV result bundles. This fits custom internal suites, target comparisons, narrow regression suites, and CI checks built from AgentV primitives.

name: repo-regressions

workspace:
  isolation: per_test
  repos:
    - path: ./repo
      repo: https://github.com/example/widget.git
      commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123
  hooks:
    before_each:
      command: ["python", "./scripts/apply-case-fixtures.py"]

execution:
  targets: [codex, claude]

assertions:
  - name: tests-pass
    type: code-grader
    command: ["python", "./graders/run-tests.py"]
    required: true

Use a Harbor-backed runner for standard benchmark suites Harbor owns, such as SWE-Bench Verified, Multi-SWE-Bench, Terminal-Bench, or suites with Harbor-owned Docker and Compose adapters. In that path AgentV should stay at the orchestration boundary: launch or import the Harbor job, apply AgentV gates to the imported results, and link Opik traces when Harbor uploads them.

# Proposed runner boundary, not a current AgentV task schema.
name: swebench-verified-codex

execution:
  runner: harbor
  harbor:
    dataset: swebench-verified
    agent: codex
    model: openai/gpt-5-mini
    opik:
      enabled: true

Do not translate Harbor task.toml, verifier packaging, or suite-specific Docker/Compose adapter fields into AgentV core eval schema. If the benchmark’s runtime contract is already owned by Harbor, keep those details in Harbor and let AgentV consume the job metadata, rewards, artifacts, and trace links.

Finance-Style Generated Dataset

Generated datasets often need stable row provenance more than workspace setup. Keep the generated row identity in metadata, use expected_output for the gold answer, and score with rubrics or an LLM/code grader.

name: finance-research-generated
description: Generated finance research cases with row-level provenance.

assertions:
  - name: answer-quality
    type: llm-grader
    prompt: ./graders/finance-answer.md
    required: true

tests:
  - id: finance-agent-row-0042
    criteria: Answer the finance question with the correct conclusion and evidence.
    input: |
      Research the company filing and answer:
      What drove the year-over-year change in gross margin?
    expected_output:
      - role: assistant
        content: |
          Gross margin improved because product mix shifted toward higher-margin
          software revenue while fulfillment costs declined.
    metadata:
      source_repo: https://github.com/example/finance-research-dataset.git
      source_commit: 05b8b2e9f071e8d0a6f1c2b3d4e5f60718293abc
      source_file: data/generated/finance_agent.csv
      source_row: 42
      question_type: margin_analysis

Here, source_repo, source_commit, source_file, source_row, and question_type are informational metadata. They support audits, slices, and regeneration checks. If a hook or grader needs the source file at runtime, clone it through workspace.repos or make the generator output available as a normal fixture file.

Optional Source-Side Case Directories

Inline YAML is fine when a case has a short prompt, a short expected answer, and a few metadata fields. Move source inputs into case directories only when the benchmark starts accumulating bulky authoring resources:

The case has patches, hidden tests, oracle JSON, screenshots, reports, or fixture files.
The prompt or expected output is long enough that YAML diffs become hard to review.
Each task needs a different workspace template or setup files.
A generator emits many rows and reviewers need to inspect individual cases.
Hook and grader scripts need stable file paths for per-case resources.

Use an external YAML or JSONL file for many simple generated rows:

name: generated-finance
tests: ./cases.jsonl

Use case directories when each case needs supporting files:

swe-benchmark/
  EVAL.yaml
  cases/
    widget-1234/
      case.yaml
      prompt.md
      test.patch
      oracle.json
      workspace/
        README.md

name: swe-benchmark
workspace:
  repos:
    - path: ./repo
      repo: https://github.com/example/widget.git
      commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123
tests: ./cases/

criteria: Fix the widget parser regression.
input:
  - role: user
    content:
      - type: file
        value: cases/widget-1234/prompt.md
metadata:
  repo_url: https://github.com/example/widget.git
  source_commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123
  test_patch: cases/widget-1234/test.patch
  oracle_file: cases/widget-1234/oracle.json

When tests points to a directory, AgentV discovers each immediate subdirectory’s case.yaml, uses the directory name as id if no id is set, and automatically uses a workspace/ subdirectory as that case’s workspace.template. File blocks still use the normal eval-file search roots, so include the case directory in paths such as cases/widget-1234/prompt.md. Metadata paths are not resolved by AgentV; resolve them in your hook or grader script.

Authoring Rules

Do not add benchmark-specific fields when metadata plus hooks or custom assertions can express the need.
Do not duplicate operational checkout state only in metadata. Put the real checkout under workspace.repos.
Keep metadata snake_case because it crosses process and result boundaries.
Prefer expected_output for passive gold answers and code-grader for active commands, file checks, or generated artifact validation.
Prefer case directories over long inline YAML only for bulky source inputs; the generated run folder remains the portable artifact contract.