Skip to content

Graph Token-Usage Benchmark — v3 Results

Mirrored from benchmarks/graph/v3/RESULTS.md. Edit the source document in the repository, not this generated page.

Task: v3 graph MCP parity sweep (T20260423-0524) Sweep date: 2026-04-23 Sweep IDs: 20260423-194444-3b15a2 (claude), 20260423-210616-9a7e7c (codex) Harness SHA at report write: a37f95cecca7d22711a4b47d9fddb7efed2a0f3b Scope: 2 providers × 3 arms × 10 fixtures × 3 seeds = 180 cells. Three errored cells are excluded from aggregate tables. Providers: Claude (claude-sonnet-4-6) and Codex (gpt-5.3-codex). Fixtures: callers-run-deterministic-containers, deps-orbit-knowledge-consumers, impact-scope-strategy-callsites, impact-tool-context-struct-literals, locate-agentruntime, locate-loopaudit-variants, locate-v2-runtime-host-trait, trace-policy-denial-wiring, trace-tool-call-event-construct-sites, trace-v2runtime-production-impls.


  1. Codex hybrid utilization flipped once MCP parity landed. The v2→v3 graph-use rate moved 0/30 → 23/30. v3 codex hybrid made 79 graph calls.
  2. Claude still did not organically select graph. Claude hybrid made 0 graph calls in 30 runs, even though graph-only proves the tools were available.
  3. Codex accuracy improved under graph access. Codex pass rates were no-graph 28/30, graph-only 30/30, hybrid 29/30.
  4. Cost remains mixed under the strict per-cell reading. Codex graph-only clears the 1.3× no-graph threshold on 4/10 fixtures; Claude clears it on 1/10.
  5. The firehose problem is real. impact-tool-context-struct-literals is a 12.43× codex graph-only outlier; one passing graph run made 17 pack calls to assemble what no-graph solved far cheaper.

Per-arm counts from the frozen run records’ tool_calls histograms.

provider / armnon-error runsruns_with_graph_callgraph_callsshell_or_fs_calls
claude / no-graph2900100
claude / hybrid300093
claude / graph-only29293500
codex / no-graph2900219
codex / hybrid302379148
codex / graph-only30304680

Codex hybrid graph-call rate by fixture:

fixtureruns_with_graph_callgraph_callspass_rate
callers-run-deterministic-containers2/3103/3
deps-orbit-knowledge-consumers3/333/3
impact-scope-strategy-callsites0/303/3
impact-tool-context-struct-literals1/333/3
locate-agentruntime3/3113/3
locate-loopaudit-variants2/343/3
locate-v2-runtime-host-trait3/363/3
trace-policy-denial-wiring3/3183/3
trace-tool-call-event-construct-sites3/3162/3
trace-v2runtime-production-impls3/383/3

The impact-scope-strategy-callsites 0/3 codex-hybrid outlier is not a graph loss: it is a 4-file grep-ergonomic task, and codex passed all three seeds without graph.


Verbatim from:

Terminal window
GRAPH_VERSION=v3 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v3/runs --tasks benchmarks/graph/v3/tasks
providerarmtask_classrunspass_ratemedian_total_tokensp90_total_tokenstokens_per_successgraph_callsgraph_call_rateshell_or_fs_calls
claudegraph-onlydeps3100%122115501245173/3 = 100.0%0
claudegraph-onlyimpact650%602211912142841906/6 = 100.0%0
claudegraph-onlylocate8100%718938742228/8 = 100.0%0
claudegraph-onlytrace1275%34265691396812112/12 = 100.0%0
claudehybriddeps3100%29255937800/3 = 0.0%3
claudehybridimpact6100%7762298109200/6 = 0.0%24
claudehybridlocate9100%43890941600/9 = 0.0%18
claudehybridtrace1283%10362441149400/12 = 0.0%48
claudeno-graphdeps3100%31566942500/3 = 0.0%5
claudeno-graphimpact6100%8962390112400/6 = 0.0%25
claudeno-graphlocate8100%44970541400/8 = 0.0%18
claudeno-graphtrace1292%9542601135900/12 = 0.0%52
codexgraph-onlydeps3100%101474168219321383/3 = 100.0%0
codexgraph-onlyimpact6100%553763186551148001156/6 = 100.0%0
codexgraph-onlylocate9100%149042865014801439/9 = 100.0%0
codexgraph-onlytrace12100%40187922504363827212/12 = 100.0%0
codexhybriddeps3100%16138173521631133/3 = 100.0%14
codexhybridimpact6100%11175335011498731/6 = 16.7%40
codexhybridlocate9100%140181573712424218/9 = 88.9%14
codexhybridtrace1292%2123252050259495211/12 = 91.7%80
codexno-graphdeps3100%12416129331251600/3 = 0.0%15
codexno-graphimpact6100%15916197871577300/6 = 0.0%55
codexno-graphlocate989%23253377952532200/9 = 0.0%54
codexno-graphtrace11100%26184376892594000/11 = 0.0%95

USD totals are available for Claude only. Codex reports $0.0000 because the Codex CLI does not emit billing, not because usage was free.

provider / armcost_usd
claude / no-graph$1.9657
claude / hybrid$1.9091
claude / graph-only$4.3260
codex / no-graph$0.0000
codex / hybrid$0.0000
codex / graph-only$0.0000

The pre-registered cost criterion is token-based and per-cell: graph-only median (input + output) tokens must be ≤ 1.3× the matching no-graph median for the same provider × fixture cell.

fixturecodex go/ngcodex ≤1.3×claude go/ngclaude ≤1.3×
deps-orbit-knowledge-consumers0.82×yes3.88×no
locate-agentruntime0.44×yes1.80×no
locate-v2-runtime-host-trait0.18×yes1.57×no
trace-v2runtime-production-impls0.58×yes0.77×yes
trace-policy-denial-wiring1.74×no1.46×no
locate-loopaudit-variants4.87×no3.88×no
callers-run-deterministic-containers1.67×no5.10×no
trace-tool-call-event-construct-sites2.08×no2.74×no
impact-scope-strategy-callsites1.85×no27.18×no
impact-tool-context-struct-literals12.43×no2.82×no
per-cell pass count4 / 101 / 10

Aggregate medians are secondary: codex graph-only is 1.09× no-graph across all fixtures, while Claude graph-only is 1.44×. The strict per-cell reading is the load-bearing result.


Non-passing cells:

providerarmfixtureseedverdictdiagnostic
claudegraph-onlycallers-run-deterministic-containers2failoracle rejected final message
claudegraph-onlyimpact-tool-context-struct-literals1failoracle rejected final message
claudegraph-onlyimpact-tool-context-struct-literals2failoracle rejected final message
claudegraph-onlyimpact-tool-context-struct-literals3failoracle rejected final message
claudegraph-onlytrace-tool-call-event-construct-sites1failoracle rejected final message
claudegraph-onlytrace-tool-call-event-construct-sites2failoracle rejected final message
claudehybridtrace-policy-denial-wiring2failoracle rejected final message
claudehybridtrace-policy-denial-wiring3failoracle rejected final message
claudeno-graphtrace-policy-denial-wiring1failoracle rejected final message
codexhybridtrace-tool-call-event-construct-sites2failoracle rejected final message
codexno-graphlocate-v2-runtime-host-trait1failoracle rejected final message

Errored cells excluded from aggregate tables:

providerarmfixtureseeddiagnostic
claudeno-graphlocate-loopaudit-variants2claude run reported is_error=True: 529
claudegraph-onlylocate-agentruntime2claude run reported is_error=True: 529
codexno-graphtrace-policy-denial-wiring2codex produced no parseable result (exit=124)

Manual audit found several v3 oracle rejections that are better treated as grader artifacts: the substring oracle rejects answers that mention excluded paths as excluded. v4 replaces that with a structured {"answer": [...], "excluded": [...]} oracle.


v3’s METHOD.md pre-registered the cull threshold:

  1. Hybrid utilization ≥ 20% on at least one provider.
  2. Graph-only median (input + output) tokens ≤ 1.3× the matching no-graph median for the same provider × fixture cell.

Criterion 1 passes on a provider-any reading because codex hybrid used graph in 23/30 runs. Claude fails that criterion at 0/30.

Criterion 2 is mixed and does not clear as a clean sweep: codex passes 4/10 fixture cells, Claude passes 1/10. The cost failures cluster on callers, impact, and trace-construction fixtures, which are exactly where the current graph surface either over-includes by signature/name or hands the agent too much payload.

Disposition: retain the agent-facing orbit_graph_* MCP surface, carried primarily by codex utilization and accuracy. This is not a clean benchmark pass; it is a product call that the surface is useful for shell-first providers while still needing payload and precision work.


hypothesis / thresholdresult
Codex hybrid 0/30 utilization in v2 was a tool-surface asymmetry.Supported. MCP parity flips codex hybrid to 23/30 graph-use runs.
Claude will organically use graph once v3 closes the cross-provider setup.Falsified. Claude hybrid remains 0/30.
Graph-only cost can stay within 1.3× no-graph per provider × fixture cell.Mostly falsified. Codex passes 4/10; Claude passes 1/10.
Graph access can improve codex accuracy on grep-hard fixtures.Supported. Codex graph-only is 30/30; no-graph is 28/30.
Payload volume is an important remaining failure mode.Supported. impact-tool-context-struct-literals reaches 12.43× under codex graph-only.

  1. Keep the MCP graph surface enabled. Codex uses it heavily once offered as first-class MCP tools.
  2. Treat provider behavior as part of the product decision. Claude pays schema/context cost without selecting graph under hybrid; codex gets real navigation value.
  3. Make v4 diagnostic, not keep/cull. v3 already settles retention. v4 should isolate payload firehose, signature-vs-type precision gaps, selector ambiguity, and graph-strength cases.
  4. Replace substring grading. v3 oracle artifacts are noisy enough that v4 needs structured answer/excluded grading.
  5. Measure per-cell, not only aggregate medians. The aggregate codex 1.09× cost hides a 12.43× fixture outlier.

  • Token accounting: median_total_tokens is input_tokens + output_tokens; cached input is reported separately in the secondary aggregate output.
  • Codex billing: Codex cost remains $0.0000 because the provider normalizer has no billing feed.
  • Aggregate reproduction: GRAPH_VERSION=v3 python3 benchmarks/graph/scripts/aggregate.py --runs benchmarks/graph/v3/runs --tasks benchmarks/graph/v3/tasks.
  • Fixture-level utilization table: derived directly from frozen run records’ tool_calls fields.
  • Known caveats carried into v4: structured oracle, per-cell threshold specification, per-tool payload diagnostics, and failure taxonomy are all direct responses to v3’s residual ambiguity.