Sanity Agent: One Agent as Evaluator to Another

Background: what problem this solves

Tool-using agents are hard to validate if you only check the final answer—you cannot tell sound tool use from a lucky guess. Raw log prose is hard to automate and aggregate. A harness like Sanity Agent exists to freeze each real run into a schema-valid trace, then have a second LLM act only as judge against gold, emitting a verdict you can roll up—so you score both outcome and trajectory, while avoiding one mega-prompt that is player and referee, or an orchestrator that reads traces and substitutes for a real Evaluator.

Four terms

Term	Definition
Runner	Agent under test; prompt is task only (no gold), tools allowed, must write a trace file.
trace	Structured timeline of one run (thoughts, tools, results, final answer, …), shaped by JSON Schema—evidence for process checks.
Evaluator	Judge LLM; prompt is trace + gold + verdict path only; no tools; writes verdict.
verdict	Structured judgment for that run (correctness, expected behavior, …), schema-backed; reasons should point into the trace.

In one line: Runner → trace → programmatic schema gate → Evaluator → verdict → schema gate → aggregate; the orchestrator schedules and validates only—no ghost traces, no inline grading.

Workflow

Preflight (environment / auth + case list) → per case: Runner → trace → schema → Evaluator → verdict → schema → Aggregate when done. Hard rule: no shortcut where the orchestrator reads the trace and replaces the Evaluator.

flowchart TB
  subgraph PF["Preflight"]
    CH[Checker]
    CL[Case list]
    CH --> CL
  end

  subgraph PC["Each case"]
    OR[Orchestrator]
    RU[Runner: task only]
    TR[("trace")]
    GT{Schema: trace}
    EV[Evaluator: trace + gold]
    VD[("verdict")]
    GV{Schema: verdict}

    OR --> RU --> TR --> GT --> EV --> VD --> GV
  end

  AG[Aggregate]

  PF --> OR
  GV --> MORE{More cases?}
  MORE -->|yes| OR
  MORE -->|no| AG