Skip to content

Graph Token-Usage Benchmark — Sweep Results

Mirrored from benchmarks/graph/v1/RESULTS.md. Edit the source document in the repository, not this generated page.

Task: T20260422-1609 Sweep date: 2026-04-22 Sweep seed: 1609 (shared across both providers) Scope: 2 providers × 3 arms × 6 fixtures × 5 seeds = 180 runs per provider. Fixtures: locate-agentruntime, locate-v2-runtime-host-trait, trace-policy-denial-wiring, trace-v2runtime-production-impls, impact-scope-strategy-callsites, deps-orbit-knowledge-consumers.

The headline finding of this sweep comes from a transcript-level utilization audit layered on top of the raw token aggregates: agents almost never invoke graph tools when they have a choice, so hybrid is functionally no-graph on this fixture set. Cost tables alone are misleading without this utilization context — read the “Tool-utilization audit” section first.


  1. Graph tools are near-zero-utilization in hybrid. Across 60 hybrid runs (30 Claude + 30 Codex), graph tools fired exactly once — a single orbit_graph_implementors call on one Claude seed of locate-agentruntime. Codex made zero graph-tool calls across 30 hybrid runs.
  2. Which means: “hybrid beats no-graph” is a mirage. Token parity is not evidence that graph tools help when available; it is evidence that their schema overhead is small enough to not hurt when they are silently ignored in favour of grep / shell.
  3. Graph-only lifts Codex accuracy but is structurally expensive. Forcing graph-only took Codex’s locate pass-rate from 80 % → 100 % and trace from 80 % → 100 %, at a 1.2×–2.2× token multiplier and 1.5–3.1 M cache_read_tokens / class of MCP schema tax.
  4. Claude is at the pass-rate ceiling on this fixture set. 119 / 120 passes regardless of arm. The sweep cannot discriminate Claude arms on accuracy — only on cost.
  5. The practical question is no longer “which arm to default to.” It is “why don’t agents reach for graph tools when they are available, and what would it take for them to?”

Per-transcript counts of tool_use blocks in hybrid runs. 5 runs per (provider × fixture) cell.

fixturerunsruns_with_graph_callgraph_callsGrepReadGlobtotal_tool_uses
deps-orbit-knowledge-consumers5005005
impact-scope-strategy-callsites5005005
locate-agentruntime511100012
locate-v2-runtime-host-trait5005005
trace-policy-denial-wiring5001013023
trace-v2runtime-production-impls5005005
total30114013055

The one graph call: mcp__orbit-bench__orbit_graph_implementors with trait_selector=symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait — a textbook fit for the tool. The other 29 Claude hybrid runs reached straight for Grep and were done in 1–2 tool calls.

fixturerunsruns_with_graph_callgraph_callsshell_execstotal_tool_uses
deps-orbit-knowledge-consumers5003131
impact-scope-strategy-callsites5001616
locate-agentruntime5003636
locate-v2-runtime-host-trait5002222
trace-policy-denial-wiring5006969
trace-v2runtime-production-impls5003939
total3000213213

Codex made zero MCP graph-tool calls in hybrid. Every hybrid Codex run solved the task with rg / grep / find / sed / cat via shell command_execution.


Primary table — provider × arm × task_class

Section titled “Primary table — provider × arm × task_class”
providerarmtask_classrunspass_ratemedian_total_tokensp90_total_tokenstokens_per_success
claudegraph-onlydeps5100 %622729637
claudegraph-onlyimpact5100 %5 0896 3074 767
claudegraph-onlylocate10100 %645797664
claudegraph-onlytrace10100 %1 4733 1111 614
claudehybriddeps5100 %361502373
claudehybridimpact5100 %295338309
claudehybridlocate10100 %336684368
claudehybridtrace1090 %9201 8661 098
claudeno-graphdeps5100 %285767411
claudeno-graphimpact5100 %288295290
claudeno-graphlocate10100 %420479406
claudeno-graphtrace10100 %1 0982 0841 158
codexgraph-onlydeps5100 %22 61547 94127 506
codexgraph-onlyimpact5100 %22 67159 99530 981
codexgraph-onlylocate10100 %17 86548 86421 134
codexgraph-onlytrace10100 %25 33237 39226 151
codexhybriddeps5100 %13 97523 79513 488
codexhybridimpact5100 %12 11724 52816 312
codexhybridlocate1090 %13 01414 93812 916
codexhybridtrace10100 %18 29429 02917 924
codexno-graphdeps5100 %12 42713 51712 581
codexno-graphimpact5100 %12 77622 28914 904
codexno-graphlocate1080 %13 75623 41117 377
codexno-graphtrace1080 %15 10830 31318 323

Cost (USD) — Claude only (Codex CLI does not emit billing)

Section titled “Cost (USD) — Claude only (Codex CLI does not emit billing)”
armsonnet costhaiku costarm total
graph-only$2.7504$0.0188$2.77
hybrid$1.3943$0.0188$1.41
no-graph$1.4378$0.0188$1.46

Total Claude sweep: ~$5.65. Graph-only was ~1.9× more expensive than the other two arms for zero pass-rate lift on Claude.


240 runs → 234 pass / 6 fail / 0 error.

runcause
claude / hybrid / trace-policy-denial-wiring / seed=3Oracle rejected (stochastic).
codex / hybrid / locate-v2-runtime-host-trait / seed=3Oracle rejected.
codex / no-graph / locate-v2-runtime-host-trait / seeds=2,3Oracle rejected (2 / 5).
codex / no-graph / trace-policy-denial-wiring / seeds=1,4Oracle rejected (2 / 5).

Codex’s only systemic accuracy gap is no-graph on locate / trace. Graph access (any form) fixes it.


Re-interpretation: what this sweep actually measured

Section titled “Re-interpretation: what this sweep actually measured”
apparent effect (from cost tables alone)real mechanism (once utilization is accounted for)
“Hybrid wins on cost.”Hybrid ≈ no-graph because agents ignore the graph surface and use grep / shell. The 0 % utilization rate is the mechanism.
”Claude’s 16× blowup on impact/graph-only.”Claude is being forced to solve a grep-shaped problem (ScopeStrategy:: tokens across 4 files) through structural queries. With no grep available, it calls orbit_graph_search with noisy payloads and reasons over them. Real.
”Codex graph-only lifts locate / trace from 80 % → 100 %.”Real and reproducible. When agents can’t fall back to grep, they actually use the graph tools, and on the two fixtures where grep is error-prone (locate-v2-runtime-host-trait has a filename-collision trap; trace-policy-denial-wiring requires distinguishing construction from destructuring), the graph tools fix the accuracy gap.
”MCP schema tax is cheap.”Only because the tools aren’t being used. Once Codex is forced to use them (graph-only), cache_read_tokens jump to 1.5–3.1 M / class — an order of magnitude above hybrid.

  • H1 (fixtures are grep-shaped). ✅ Still supported — and now the utilization data directly proves the agents agree: they reach for grep whenever it is available.
  • H2 (graph tool payloads are verbose). ✅ Supported where measurable (graph-only only — hybrid doesn’t use them enough to measure).
  • H3 (MCP schema tax in context). ✅ Supported under graph-only. Not measurable under hybrid, because the same schema tax appears to be tolerable when the tools are never invoked.
  • H4 (per-turn session cost). Entangled with H3; still untested.
  • H5 (non-code file scanning hurts graph). ✅ Supported — impact-scope-strategy-callsites/graph-only is the single worst cell in the whole sweep on Claude.
  • H6 (graph index drift). Not testable from this data.
  • H7 (agent over-uses graph).FALSIFIED. The opposite is true. Agents under-use graph — essentially never reaching for it when grep is available, even when the task structurally favors graph (trait-impl walks, construct-site search).
  • H8 (fixture / prompt drift). Partially falsified; one pre-sweep drift in locate-agentruntime was caught before the run.

  1. Stop comparing hybrid vs no-graph as if it measured graph-tool value. It doesn’t. On this fixture set it measures the size of the tool schema in the system prompt, at zero utilization. To measure graph-tool value you need either (a) fixtures where grep is genuinely the wrong tool so agents reach for graph on their own, or (b) a graph-preferred arm where the system prompt actively instructs the agent to try graph first.
  2. The real finding of this sweep is the utilization rate. 1 / 60 hybrid runs. The follow-up task is not “tune the token budget” — it is “investigate why agents decline to use graph tools when offered.” Plausible causes to probe: tool descriptions are grep-shaped in the prompt, return payloads are harder to reason over than ripgrep hits, or agents default to the most familiar retrieval surface under uncertainty.
  3. The one accuracy signal is real: forced graph access fixes Codex’s error-prone locate / trace cases. If we want that lift in production without forcing graph-only, we need agents to choose the graph tools, which today they don’t.
  4. Future fixtures must make graph tools the obviously-right answer. Candidates: cross-crate trait-impl walks under name collisions, transitive caller queries, refactor-impact across the type graph. These are cases where grep produces ambiguous / noisy results and the graph’s structural index is load-bearing. Only then will utilization rise and the hybrid-vs-no-graph comparison become informative.
  5. Instrument tool-utilization in the aggregator. This sweep needed an ad-hoc transcript pass to surface the headline finding. aggregate.py should emit a per-arm tool-call-mix column so the next sweep reports utilization alongside tokens and pass-rate.

  • Utilization counts were produced by a transcript-level scan of runs/<provider>/hybrid/<task>/<seed>.transcript.json. For Claude, each message.content block of type=tool_use was counted by name; graph calls are those with name.startswith("mcp__orbit-bench__orbit_graph_"). For Codex, item.completed events of type mcp_tool_call were matched against orbit_graph_; command_execution items were counted as shell.
  • Token-accounting convention: input_tokens is UNCACHED new input across both providers. Codex _normalize_codex_result subtracts cached_input_tokens at the provider boundary so aggregate.py’s input_tokens + output_tokens column is cross-provider comparable. Regression tests: scripts/test_providers.py::TestTokenAccountingConvention.
  • Codex $0 cost is a CLI limitation, not an omission. All USD numbers are Claude only.
  • No error verdicts — arm enforcement held across all 240 runs.
  • Reproducing utilization data: ad-hoc transcript scan during report generation; not yet merged into aggregate.py (see recommendation 5).