Graph Token-Usage Benchmark — Sweep Results
Mirrored from
benchmarks/graph/v1/RESULTS.md. Edit the source document in the repository, not this generated page.
Task: T20260422-1609
Sweep date: 2026-04-22
Sweep seed: 1609 (shared across both providers)
Scope: 2 providers × 3 arms × 6 fixtures × 5 seeds = 180 runs per provider.
Fixtures:
locate-agentruntime, locate-v2-runtime-host-trait, trace-policy-denial-wiring, trace-v2runtime-production-impls, impact-scope-strategy-callsites, deps-orbit-knowledge-consumers.
The headline finding of this sweep comes from a transcript-level utilization audit layered on top of the raw token aggregates: agents almost never invoke graph tools when they have a choice, so
hybridis functionallyno-graphon this fixture set. Cost tables alone are misleading without this utilization context — read the “Tool-utilization audit” section first.
Headline
Section titled “Headline”- Graph tools are near-zero-utilization in
hybrid. Across 60 hybrid runs (30 Claude + 30 Codex), graph tools fired exactly once — a singleorbit_graph_implementorscall on one Claude seed oflocate-agentruntime. Codex made zero graph-tool calls across 30 hybrid runs. - Which means: “hybrid beats no-graph” is a mirage. Token parity is not evidence that graph tools help when available; it is evidence that their schema overhead is small enough to not hurt when they are silently ignored in favour of grep / shell.
- Graph-only lifts Codex accuracy but is structurally expensive. Forcing graph-only took Codex’s
locatepass-rate from 80 % → 100 % andtracefrom 80 % → 100 %, at a 1.2×–2.2× token multiplier and 1.5–3.1 M cache_read_tokens / class of MCP schema tax. - Claude is at the pass-rate ceiling on this fixture set. 119 / 120 passes regardless of arm. The sweep cannot discriminate Claude arms on accuracy — only on cost.
- The practical question is no longer “which arm to default to.” It is “why don’t agents reach for graph tools when they are available, and what would it take for them to?”
Tool-utilization audit (hybrid arm)
Section titled “Tool-utilization audit (hybrid arm)”Per-transcript counts of tool_use blocks in hybrid runs. 5 runs per (provider × fixture) cell.
Claude / hybrid
Section titled “Claude / hybrid”| fixture | runs | runs_with_graph_call | graph_calls | Grep | Read | Glob | total_tool_uses |
|---|---|---|---|---|---|---|---|
| deps-orbit-knowledge-consumers | 5 | 0 | 0 | 5 | 0 | 0 | 5 |
| impact-scope-strategy-callsites | 5 | 0 | 0 | 5 | 0 | 0 | 5 |
| locate-agentruntime | 5 | 1 | 1 | 10 | 0 | 0 | 12 |
| locate-v2-runtime-host-trait | 5 | 0 | 0 | 5 | 0 | 0 | 5 |
| trace-policy-denial-wiring | 5 | 0 | 0 | 10 | 13 | 0 | 23 |
| trace-v2runtime-production-impls | 5 | 0 | 0 | 5 | 0 | 0 | 5 |
| total | 30 | 1 | 1 | 40 | 13 | 0 | 55 |
The one graph call: mcp__orbit-bench__orbit_graph_implementors with trait_selector=symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait — a textbook fit for the tool. The other 29 Claude hybrid runs reached straight for Grep and were done in 1–2 tool calls.
Codex / hybrid
Section titled “Codex / hybrid”| fixture | runs | runs_with_graph_call | graph_calls | shell_execs | total_tool_uses |
|---|---|---|---|---|---|
| deps-orbit-knowledge-consumers | 5 | 0 | 0 | 31 | 31 |
| impact-scope-strategy-callsites | 5 | 0 | 0 | 16 | 16 |
| locate-agentruntime | 5 | 0 | 0 | 36 | 36 |
| locate-v2-runtime-host-trait | 5 | 0 | 0 | 22 | 22 |
| trace-policy-denial-wiring | 5 | 0 | 0 | 69 | 69 |
| trace-v2runtime-production-impls | 5 | 0 | 0 | 39 | 39 |
| total | 30 | 0 | 0 | 213 | 213 |
Codex made zero MCP graph-tool calls in hybrid. Every hybrid Codex run solved the task with rg / grep / find / sed / cat via shell command_execution.
Primary table — provider × arm × task_class
Section titled “Primary table — provider × arm × task_class”| provider | arm | task_class | runs | pass_rate | median_total_tokens | p90_total_tokens | tokens_per_success |
|---|---|---|---|---|---|---|---|
| claude | graph-only | deps | 5 | 100 % | 622 | 729 | 637 |
| claude | graph-only | impact | 5 | 100 % | 5 089 | 6 307 | 4 767 |
| claude | graph-only | locate | 10 | 100 % | 645 | 797 | 664 |
| claude | graph-only | trace | 10 | 100 % | 1 473 | 3 111 | 1 614 |
| claude | hybrid | deps | 5 | 100 % | 361 | 502 | 373 |
| claude | hybrid | impact | 5 | 100 % | 295 | 338 | 309 |
| claude | hybrid | locate | 10 | 100 % | 336 | 684 | 368 |
| claude | hybrid | trace | 10 | 90 % | 920 | 1 866 | 1 098 |
| claude | no-graph | deps | 5 | 100 % | 285 | 767 | 411 |
| claude | no-graph | impact | 5 | 100 % | 288 | 295 | 290 |
| claude | no-graph | locate | 10 | 100 % | 420 | 479 | 406 |
| claude | no-graph | trace | 10 | 100 % | 1 098 | 2 084 | 1 158 |
| codex | graph-only | deps | 5 | 100 % | 22 615 | 47 941 | 27 506 |
| codex | graph-only | impact | 5 | 100 % | 22 671 | 59 995 | 30 981 |
| codex | graph-only | locate | 10 | 100 % | 17 865 | 48 864 | 21 134 |
| codex | graph-only | trace | 10 | 100 % | 25 332 | 37 392 | 26 151 |
| codex | hybrid | deps | 5 | 100 % | 13 975 | 23 795 | 13 488 |
| codex | hybrid | impact | 5 | 100 % | 12 117 | 24 528 | 16 312 |
| codex | hybrid | locate | 10 | 90 % | 13 014 | 14 938 | 12 916 |
| codex | hybrid | trace | 10 | 100 % | 18 294 | 29 029 | 17 924 |
| codex | no-graph | deps | 5 | 100 % | 12 427 | 13 517 | 12 581 |
| codex | no-graph | impact | 5 | 100 % | 12 776 | 22 289 | 14 904 |
| codex | no-graph | locate | 10 | 80 % | 13 756 | 23 411 | 17 377 |
| codex | no-graph | trace | 10 | 80 % | 15 108 | 30 313 | 18 323 |
Cost (USD) — Claude only (Codex CLI does not emit billing)
Section titled “Cost (USD) — Claude only (Codex CLI does not emit billing)”| arm | sonnet cost | haiku cost | arm total |
|---|---|---|---|
| graph-only | $2.7504 | $0.0188 | $2.77 |
| hybrid | $1.3943 | $0.0188 | $1.41 |
| no-graph | $1.4378 | $0.0188 | $1.46 |
Total Claude sweep: ~$5.65. Graph-only was ~1.9× more expensive than the other two arms for zero pass-rate lift on Claude.
Pass-rate breakdown
Section titled “Pass-rate breakdown”240 runs → 234 pass / 6 fail / 0 error.
| run | cause |
|---|---|
| claude / hybrid / trace-policy-denial-wiring / seed=3 | Oracle rejected (stochastic). |
| codex / hybrid / locate-v2-runtime-host-trait / seed=3 | Oracle rejected. |
| codex / no-graph / locate-v2-runtime-host-trait / seeds=2,3 | Oracle rejected (2 / 5). |
| codex / no-graph / trace-policy-denial-wiring / seeds=1,4 | Oracle rejected (2 / 5). |
Codex’s only systemic accuracy gap is no-graph on locate / trace. Graph access (any form) fixes it.
Re-interpretation: what this sweep actually measured
Section titled “Re-interpretation: what this sweep actually measured”| apparent effect (from cost tables alone) | real mechanism (once utilization is accounted for) |
|---|---|
| “Hybrid wins on cost.” | Hybrid ≈ no-graph because agents ignore the graph surface and use grep / shell. The 0 % utilization rate is the mechanism. |
”Claude’s 16× blowup on impact/graph-only.” | Claude is being forced to solve a grep-shaped problem (ScopeStrategy:: tokens across 4 files) through structural queries. With no grep available, it calls orbit_graph_search with noisy payloads and reasons over them. Real. |
| ”Codex graph-only lifts locate / trace from 80 % → 100 %.” | Real and reproducible. When agents can’t fall back to grep, they actually use the graph tools, and on the two fixtures where grep is error-prone (locate-v2-runtime-host-trait has a filename-collision trap; trace-policy-denial-wiring requires distinguishing construction from destructuring), the graph tools fix the accuracy gap. |
| ”MCP schema tax is cheap.” | Only because the tools aren’t being used. Once Codex is forced to use them (graph-only), cache_read_tokens jump to 1.5–3.1 M / class — an order of magnitude above hybrid. |
Revised hypothesis reconciliation
Section titled “Revised hypothesis reconciliation”- H1 (fixtures are grep-shaped). ✅ Still supported — and now the utilization data directly proves the agents agree: they reach for grep whenever it is available.
- H2 (graph tool payloads are verbose). ✅ Supported where measurable (graph-only only — hybrid doesn’t use them enough to measure).
- H3 (MCP schema tax in context). ✅ Supported under graph-only. Not measurable under hybrid, because the same schema tax appears to be tolerable when the tools are never invoked.
- H4 (per-turn session cost). Entangled with H3; still untested.
- H5 (non-code file scanning hurts graph). ✅ Supported —
impact-scope-strategy-callsites/graph-onlyis the single worst cell in the whole sweep on Claude. - H6 (graph index drift). Not testable from this data.
- H7 (agent over-uses graph). ❌ FALSIFIED. The opposite is true. Agents under-use graph — essentially never reaching for it when grep is available, even when the task structurally favors graph (trait-impl walks, construct-site search).
- H8 (fixture / prompt drift). Partially falsified; one pre-sweep drift in
locate-agentruntimewas caught before the run.
Revised recommendations
Section titled “Revised recommendations”- Stop comparing
hybridvsno-graphas if it measured graph-tool value. It doesn’t. On this fixture set it measures the size of the tool schema in the system prompt, at zero utilization. To measure graph-tool value you need either (a) fixtures where grep is genuinely the wrong tool so agents reach for graph on their own, or (b) agraph-preferredarm where the system prompt actively instructs the agent to try graph first. - The real finding of this sweep is the utilization rate. 1 / 60 hybrid runs. The follow-up task is not “tune the token budget” — it is “investigate why agents decline to use graph tools when offered.” Plausible causes to probe: tool descriptions are grep-shaped in the prompt, return payloads are harder to reason over than ripgrep hits, or agents default to the most familiar retrieval surface under uncertainty.
- The one accuracy signal is real: forced graph access fixes Codex’s error-prone locate / trace cases. If we want that lift in production without forcing graph-only, we need agents to choose the graph tools, which today they don’t.
- Future fixtures must make graph tools the obviously-right answer. Candidates: cross-crate trait-impl walks under name collisions, transitive caller queries, refactor-impact across the type graph. These are cases where grep produces ambiguous / noisy results and the graph’s structural index is load-bearing. Only then will utilization rise and the hybrid-vs-no-graph comparison become informative.
- Instrument tool-utilization in the aggregator. This sweep needed an ad-hoc transcript pass to surface the headline finding.
aggregate.pyshould emit a per-arm tool-call-mix column so the next sweep reports utilization alongside tokens and pass-rate.
Methodology notes
Section titled “Methodology notes”- Utilization counts were produced by a transcript-level scan of
runs/<provider>/hybrid/<task>/<seed>.transcript.json. For Claude, eachmessage.contentblock oftype=tool_usewas counted byname; graph calls are those withname.startswith("mcp__orbit-bench__orbit_graph_"). For Codex,item.completedevents of typemcp_tool_callwere matched againstorbit_graph_;command_executionitems were counted as shell. - Token-accounting convention:
input_tokensis UNCACHED new input across both providers. Codex_normalize_codex_resultsubtractscached_input_tokensat the provider boundary soaggregate.py’sinput_tokens + output_tokenscolumn is cross-provider comparable. Regression tests:scripts/test_providers.py::TestTokenAccountingConvention. - Codex $0 cost is a CLI limitation, not an omission. All USD numbers are Claude only.
- No
errorverdicts — arm enforcement held across all 240 runs. - Reproducing utilization data: ad-hoc transcript scan during report generation; not yet merged into
aggregate.py(see recommendation 5).