Skip to content

Graph Token-Usage Benchmark — v5 (Feature-Validation Round)

Mirrored from benchmarks/graph/v5/RESULTS.md. Edit the source document in the repository, not this generated page.

Status: Closed — final round of the v1-v5 series. Scope: Focused validation of source_regex (the [T20260425-2140] tool change), not a full sweep. Codex only, graph-only arm only, 3 fixtures × 3 seeds = 9 cells. Sweep ID / seed: 20260425-115117-21a938 / 142643867 Harness SHA: 56a9c07b... plus the T20260425-2140 source_regex implementation. Why the small scope: v4 motivated a specific tool-surface change (source_regex). v5 measures whether that change delivered. A full 192-cell sweep would have re-measured shapes already characterized in v4 without changing the conclusion. See reading guide in the series wrap-up.


Feature works, agents over-use it. All 9 cells passed, and where the regex is a natural fit the call-count and token wins are large. But on two of three fixtures, agents iterated source_regex 7-19 times per seed (typically with the same regex but narrower prefixes) instead of the predicted 1-2 calls. Tool surface is correct; the affordance is what missed.


Baseline = post-T20260425-0739 Codex graph-only data from v4 (preserved at v4/_archive/codex-graph-only-pre-T20260425-2140/). Post-fix = this round.

fixturearmpassmedian tokensmedian calls/seedsource_regex calls/seed (post)
const-value-extractionbaseline3/367,51523n/a
const-value-extractionpost-fix3/327,175 (-60%)5 (-78%)2, 3, 5
reverse-export-orbit-errorbaseline3/3122,94828n/a
reverse-export-orbit-errorpost-fix3/345,907 (-63%)22 (-21%)15, 19, 7
module-surface-orbit-mcpbaseline3/321,88517n/a
module-surface-orbit-mcppost-fix3/324,533 (+12%)16 (-6%)4, 7, 8

const-value-extraction — clean win. The “find every pub const in this module” question maps directly onto source_regex: "^\s*pub\s+const\s+" plus a broad prefix. Agents wrote essentially that, ran it once or twice, and stopped. Acceptance criterion (≤6 calls/seed) met cleanly.

reverse-export-orbit-error — half-worked. Tokens fell hard (-63%) because each source_regex call returns a small answer set instead of the 6,300-char pack payloads it replaced. But agents called the same regex (pub\s+use\s+.*OrbitError) 7-19 times per seed, varying the prefix per crate. The tool can answer the question in one call; agents iterated. AC ceiling (≤5 calls/seed) missed. Failure mode is agent affordance, not feature design.

module-surface-orbit-mcp — mismatched shape. The fixture asks “what’s exported from orbit-mcp” — a structural question the regex form doesn’t compress well. Agents tried regexes like pub use|pub trait|pub fn and then had to filter false positives manually. Net token cost rose ~12%. This is the residual case the original kind-taxonomy proposal would have covered; regex isn’t the right tool here.


The agent over-iteration on reverse-export is the load-bearing finding. Two-layer remediation:

  1. orbit-graph skill update (commit 1d306f03) — added a “Source-Regex Enumeration” section with the explicit “ONE broad prefix + ONE regex, do NOT iterate” rule, three worked example shapes, and a verification-loop anti-pattern callout in the Stop Rule. Strategic guidance now lives in the skill, where it belongs, not stuffed into the MCP tool description.
  2. MCP tool descriptions remained terse — the parameter contracts only. Strategy lives in the skill; tool descriptions describe the tool. (See series wrap-up §4 for the broader principle.)

The skill ships to both Claude (via .claude/skills/orbit-graph) and Codex (via .agents/skills/orbit-graph) from a single source-of-truth at crates/orbit-core/assets/skills/orbit-graph/SKILL.md. Cross-provider parity is automatic.

A re-run of these 3 fixtures after the skill update would test the affordance fix. Deferred — see the series wrap-up for why we stopped here.


No wrong-tool runs (all 9 passed). 4 schema-coercion failures across the round, all from source_regex calls that hit the bounded-scan rule (source_regex without prefix or non-empty query and large limit). Recoverable on retry. No new failure modes introduced by the feature.


Aggregate against v4 task definitions (v5 reuses them):

Terminal window
GRAPH_VERSION=v5 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v5/runs \
--tasks benchmarks/graph/v4/tasks

Sweep command that produced this round:

Terminal window
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider codex --arms graph-only --n 3 \
--tasks reverse-export-orbit-error const-value-extraction module-surface-orbit-mcp

(Sweep targeted v4’s runs/ directory; transcripts moved into v5/runs/ after collection.)


This is the closing round. See benchmarks/graph/RESULTS.md for the cross-round synthesis.