Sanity Agent: One Agent as Evaluator to Another
Background: what problem this solves
Tool-using agents are hard to validate if you only check the final answer—you cannot tell sound tool use from a lucky guess. Raw log prose is hard to automate and aggregate. A harness like Sanity Agent exists to freeze each real run into a schema-valid trace, then have a second LLM act only as judge against gold, emitting a verdict you can roll up—so you score both outcome and trajectory, while avoiding one mega-prompt that is player and referee, or an orchestrator that reads traces and substitutes for a real Evaluator.
Four terms
| Term | Definition |
|---|---|
| Runner | Agent under test; prompt is task only (no gold), tools allowed, must write a trace file. |
| trace | Structured timeline of one run (thoughts, tools, results, final answer, …), shaped by JSON Schema—evidence for process checks. |
| Evaluator | Judge LLM; prompt is trace + gold + verdict path only; no tools; writes verdict. |
| verdict | Structured judgment for that run (correctness, expected behavior, …), schema-backed; reasons should point into the trace. |
In one line: Runner → trace → programmatic schema gate → Evaluator → verdict → schema gate → aggregate; the orchestrator schedules and validates only—no ghost traces, no inline grading.
Workflow
Preflight (environment / auth + case list) → per case: Runner → trace → schema → Evaluator → verdict → schema → Aggregate when done. Hard rule: no shortcut where the orchestrator reads the trace and replaces the Evaluator.
flowchart TB
subgraph PF["Preflight"]
CH[Checker]
CL[Case list]
CH --> CL
end
subgraph PC["Each case"]
OR[Orchestrator]
RU[Runner: task only]
TR[("trace")]
GT{Schema: trace}
EV[Evaluator: trace + gold]
VD[("verdict")]
GV{Schema: verdict}
OR --> RU --> TR --> GT --> EV --> VD --> GV
end
AG[Aggregate]
PF --> OR
GV --> MORE{More cases?}
MORE -->|yes| OR
MORE -->|no| AG