Graph Token-Usage Benchmark — v5 (Feature-Validation Round)
Mirrored from
benchmarks/graph/v5/RESULTS.md. Edit the source document in the repository, not this generated page.
Status: Closed — final round of the v1-v5 series.
Scope: Focused validation of source_regex (the [T20260425-2140] tool change), not a full sweep. Codex only, graph-only arm only, 3 fixtures × 3 seeds = 9 cells.
Sweep ID / seed: 20260425-115117-21a938 / 142643867
Harness SHA: 56a9c07b... plus the T20260425-2140 source_regex implementation.
Why the small scope: v4 motivated a specific tool-surface change (source_regex). v5 measures whether that change delivered. A full 192-cell sweep would have re-measured shapes already characterized in v4 without changing the conclusion. See reading guide in the series wrap-up.
Headline
Section titled “Headline”Feature works, agents over-use it. All 9 cells passed, and where the regex is a natural fit the call-count and token wins are large. But on two of three fixtures, agents iterated source_regex 7-19 times per seed (typically with the same regex but narrower prefixes) instead of the predicted 1-2 calls. Tool surface is correct; the affordance is what missed.
Pre-Fix vs Post-Fix Comparison
Section titled “Pre-Fix vs Post-Fix Comparison”Baseline = post-T20260425-0739 Codex graph-only data from v4 (preserved at v4/_archive/codex-graph-only-pre-T20260425-2140/). Post-fix = this round.
| fixture | arm | pass | median tokens | median calls/seed | source_regex calls/seed (post) |
|---|---|---|---|---|---|
const-value-extraction | baseline | 3/3 | 67,515 | 23 | n/a |
const-value-extraction | post-fix | 3/3 | 27,175 (-60%) | 5 (-78%) | 2, 3, 5 |
reverse-export-orbit-error | baseline | 3/3 | 122,948 | 28 | n/a |
reverse-export-orbit-error | post-fix | 3/3 | 45,907 (-63%) | 22 (-21%) | 15, 19, 7 |
module-surface-orbit-mcp | baseline | 3/3 | 21,885 | 17 | n/a |
module-surface-orbit-mcp | post-fix | 3/3 | 24,533 (+12%) | 16 (-6%) | 4, 7, 8 |
Per-Fixture Diagnosis
Section titled “Per-Fixture Diagnosis”const-value-extraction — clean win. The “find every pub const in this module” question maps directly onto source_regex: "^\s*pub\s+const\s+" plus a broad prefix. Agents wrote essentially that, ran it once or twice, and stopped. Acceptance criterion (≤6 calls/seed) met cleanly.
reverse-export-orbit-error — half-worked. Tokens fell hard (-63%) because each source_regex call returns a small answer set instead of the 6,300-char pack payloads it replaced. But agents called the same regex (pub\s+use\s+.*OrbitError) 7-19 times per seed, varying the prefix per crate. The tool can answer the question in one call; agents iterated. AC ceiling (≤5 calls/seed) missed. Failure mode is agent affordance, not feature design.
module-surface-orbit-mcp — mismatched shape. The fixture asks “what’s exported from orbit-mcp” — a structural question the regex form doesn’t compress well. Agents tried regexes like pub use|pub trait|pub fn and then had to filter false positives manually. Net token cost rose ~12%. This is the residual case the original kind-taxonomy proposal would have covered; regex isn’t the right tool here.
What Changed in Response
Section titled “What Changed in Response”The agent over-iteration on reverse-export is the load-bearing finding. Two-layer remediation:
orbit-graphskill update (commit1d306f03) — added a “Source-Regex Enumeration” section with the explicit “ONE broad prefix + ONE regex, do NOT iterate” rule, three worked example shapes, and a verification-loop anti-pattern callout in the Stop Rule. Strategic guidance now lives in the skill, where it belongs, not stuffed into the MCP tool description.- MCP tool descriptions remained terse — the parameter contracts only. Strategy lives in the skill; tool descriptions describe the tool. (See series wrap-up §4 for the broader principle.)
The skill ships to both Claude (via .claude/skills/orbit-graph) and Codex (via .agents/skills/orbit-graph) from a single source-of-truth at crates/orbit-core/assets/skills/orbit-graph/SKILL.md. Cross-provider parity is automatic.
A re-run of these 3 fixtures after the skill update would test the affordance fix. Deferred — see the series wrap-up for why we stopped here.
Failure Taxonomy
Section titled “Failure Taxonomy”No wrong-tool runs (all 9 passed). 4 schema-coercion failures across the round, all from source_regex calls that hit the bounded-scan rule (source_regex without prefix or non-empty query and large limit). Recoverable on retry. No new failure modes introduced by the feature.
Reproduction
Section titled “Reproduction”Aggregate against v4 task definitions (v5 reuses them):
GRAPH_VERSION=v5 python3 benchmarks/graph/scripts/aggregate.py \ --runs benchmarks/graph/v5/runs \ --tasks benchmarks/graph/v4/tasksSweep command that produced this round:
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider codex --arms graph-only --n 3 \ --tasks reverse-export-orbit-error const-value-extraction module-surface-orbit-mcp(Sweep targeted v4’s runs/ directory; transcripts moved into v5/runs/ after collection.)
Status
Section titled “Status”This is the closing round. See benchmarks/graph/RESULTS.md for the cross-round synthesis.