Skip to content

Graph Token-Usage Benchmark - v4 Results

Mirrored from benchmarks/graph/v4/RESULTS.md. Edit the source document in the repository, not this generated page.

Status: Complete for Codex and Claude. Codex graph-only rerun post-fix on 2026-04-25 to remove the tool-fix confound. Sweep dates: 2026-04-24 to 2026-04-25 Codex sweep IDs: 20260424-230632-f36f84 (pre-fix no-graph, graph-only), 20260425-001959-842b8c (pre-fix hybrid), 20260425-115117-21a938 (post-fix graph-only) Claude sweep IDs: 20260425-013339-cfac6a (no-graph, graph-only), 20260425-012511-8881cb (hybrid) Codex sweep seeds: 928176111 (pre-fix no-graph, graph-only), 583229300 (pre-fix hybrid), 142643867 (post-fix graph-only) Claude sweep seeds: 152346771 (no-graph, graph-only), 88160319 (hybrid) Harness SHAs: Codex pre-fix b0ce189e7053409c8754865bd154cd20e1de66a6; Claude post-fix run / Codex post-fix graph-only rerun 56a9c07b64479360f9a64ca94b40721f76226014 (post-fix branch contains both T20260425-0729 and T20260425-0739). Scope: 228 cells total. Each provider ran no-graph 12 fixtures x 3 seeds, graph-only 12 fixtures x 3 seeds (Codex’s was run twice — pre- and post-fix), and hybrid 8 fixtures x 3 seeds. No errored cells.

Comparison caveat: Codex no-graph and hybrid are pre-fix; Codex graph-only was rerun post-fix and is the canonical Codex-vs-Claude comparison row. Pre-fix Codex graph-only artifacts are preserved at benchmarks/graph/v4/_archive/codex-graph-only-pre-fix-T20260425-0739/ and remain available as a “what the bug looked like” reference. Codex no-graph was not rerun (unaffected by either fix). Codex hybrid was not rerun (passed 24/24 pre-fix; the only delta would be 3 failed graph calls becoming 0).


  1. Codex pre-fix: no-graph passed 36/36, graph-only passed 30/36, and hybrid passed 24/24. All six Codex graph-only failures came from module-surface-orbit-mcp and reverse-export-orbit-error, the two fixtures most exposed to the T20260425-0739 re-export bug.
  2. Codex post-fix (graph-only rerun): graph-only passed 36/36 (was 30/36), median 12,928 tokens (was 15,462), failed graph calls dropped to 4 (was 25). The two formerly-failing fixtures both flipped to 3/3 — confirming T20260425-0739 was the right diagnosis. Schema-coercion churn was eliminated by T20260425-0729.
  3. Claude post-fix: graph-only passed 36/36, hybrid passed 24/24, and no-graph passed 34/36. The two Claude no-graph failures were both const-value-extraction runs that omitted V2_TOOL_WILDCARD_ROOTS.
  4. The fix is symmetric across providers: post-fix, both Codex and Claude graph-only pass reverse-export-orbit-error and module-surface-orbit-mcp 3/3. The Codex-vs-Claude comparison is now clean — no tool-bug confound.
  5. Graph-only accuracy improved post-fix, but cost shifted unevenly: the two formerly-failing fixtures became expensive-passes (Codex reverse-export-orbit-error 122,948 tokens vs 13,912 no-graph; module-surface-orbit-mcp 21,885 vs 5,886). Other fixtures got cheaper because schema-coercion retries are gone. Net: median dropped, p90 stayed high.
  6. Hybrid remains the practical operating mode, but providers route differently: Codex hybrid used graph in 11/24 runs and passed all seeds. Claude hybrid used graph in only 3/24 runs, all on deps-downstream-orbit-knowledge, and passed the rest via shell/source fallback.
  7. The remaining graph work is payload shaping and selector ergonomics: post-fix Codex still hit 4 failed graph calls (2 empty-query searches, 2 unprefixed selectors). Claude post-fix hit 9 (nested-list/invalid-selector shapes). Both classes are recoverable but waste tokens on retry cycles.

Token totals are input_tokens + output_tokens, matching the aggregator’s marginal-token convention. Cached read tokens and Claude USD cost are reported separately by the raw records, but not included in the median-token columns.

providerarmrunspassmedian_total_tokensp90_total_tokensgraph_call_rategraph_callsfailed_graph_callsshell_or_fs_calls
claudeno-graph3634/3671331590/3600154
claudegraph-only3636/362330686636/3643690
claudehybrid2424/2466324493/243045
codexno-graph3636/3611446277920/3600197
codexgraph-only (pre-fix)3630/36154626487736/36334250
codexgraph-only (post-fix)3636/36129287177436/3643840
codexhybrid2424/2439001104811/2440361

Hybrid only ran on the 8 graph-strength and precision-gap fixtures, per METHOD.md. On that same 24-run subset:

providerno-graph mediangraph-only medianhybrid median
claude4911541663
codex11446151143900

Verbatim from:

Terminal window
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v4/runs \
--tasks benchmarks/graph/v4/tasks
providerarmtask_classrunspass_ratemedian_total_tokensp90_total_tokenstokens_per_successgraph_callsgraph_call_rateshell_or_fs_calls
claudegraph-onlygraph-strength12100%1310394418717712/12 = 100.0%0
claudegraph-onlypayload-volume6100%51281158456362016/6 = 100.0%0
claudegraph-onlyprecision-gap12100%1854453822529012/12 = 100.0%0
claudegraph-onlyselector-ambiguity6100%396076454308686/6 = 100.0%0
claudehybridgraph-strength12100%669184281833/12 = 25.0%12
claudehybridprecision-gap12100%468288793500/12 = 0.0%33
claudeno-graphgraph-strength12100%536184173700/12 = 0.0%31
claudeno-graphpayload-volume667%15172508220700/6 = 0.0%34
claudeno-graphprecision-gap12100%343305287400/12 = 0.0%30
claudeno-graphselector-ambiguity6100%26775296318800/6 = 0.0%59
codexgraph-only (pre-fix)graph-strength1275%9352784253390711512/12 = 100.0%0
codexgraph-only (pre-fix)payload-volume6100%1685510158729067576/6 = 100.0%0
codexgraph-only (pre-fix)precision-gap12100%1545024307146319112/12 = 100.0%0
codexgraph-only (pre-fix)selector-ambiguity650%219613782244688716/6 = 100.0%0
codexgraph-only (post-fix)graph-strength12100%88221390433698915312/12 = 100.0%0
codexgraph-only (post-fix)payload-volume6100%4172676033419841066/6 = 100.0%0
codexgraph-only (post-fix)precision-gap12100%8813355781293010712/12 = 100.0%0
codexgraph-only (post-fix)selector-ambiguity6100%261063215224042726/6 = 100.0%0
codexhybridgraph-strength12100%4232144175394269/12 = 75.0%23
codexhybridprecision-gap12100%334676073871142/12 = 16.7%38
codexno-graphgraph-strength12100%13154262301467600/12 = 0.0%67
codexno-graphpayload-volume6100%12434245271337400/6 = 0.0%36
codexno-graphprecision-gap12100%1116915969873300/12 = 0.0%42
codexno-graphselector-ambiguity6100%19298514732186100/6 = 0.0%52

median go/ng compares aggregate category medians. mean fixture go/ng and worst fixture go/ng are per-fixture ratios and are the load-bearing cost readings. Lower ratios are better for graph-only.

providercategoryno-graph passgraph-only passpass deltamedian go/ngmean fixture go/ngworst fixture go/nghybrid graph rate
claudegraph-strength12/1212/12+02.44x3.72x6.73x3/12
claudeprecision-gap12/1212/12+05.41x4.65x7.61x0/12
claudepayload-volume4/66/6+23.38x5.68x9.71xn/a
claudeselector-ambiguity6/66/6+01.48x1.42x1.94xn/a
codex (pre-fix)graph-strength12/129/12-30.71x1.70x5.59x9/12
codex (pre-fix)precision-gap12/1212/12+01.38x2.15x5.41x2/12
codex (pre-fix)payload-volume6/66/6+01.36x1.87x3.25xn/a
codex (pre-fix)selector-ambiguity6/63/6-31.14x1.45x1.89xn/a
codex (post-fix)graph-strength12/1212/12+00.67x2.61x8.84x9/12
codex (post-fix)precision-gap12/1212/12+00.79x1.53x2.99x2/12
codex (post-fix)payload-volume6/66/6+03.36x4.27x7.78xn/a
codex (post-fix)selector-ambiguity6/66/6+01.35x2.29x3.72xn/a

providermodearmrunspassmedian_total_tokensgraph_call_rategraph_callsfailed_graph_calls
claudeproductionno-graph2119/2119160/2100
claudeproductiongraph-only2121/21372021/213576
claudeproductionhybrid99/917353/930
claudesyntheticno-graph1515/152190/1500
claudesyntheticgraph-only1515/15128315/15793
claudesynthetichybrid1515/153180/1500
codexproductionno-graph2121/21141620/2100
codexproductiongraph-only (pre-fix)2115/211784721/2122218
codexproductiongraph-only (post-fix)2121/212452221/212954
codexproductionhybrid99/948723/930
codexsyntheticno-graph1515/15112780/1500
codexsyntheticgraph-only (pre-fix)1515/151497215/151127
codexsyntheticgraph-only (post-fix)1515/15842915/151430
codexsynthetichybrid1515/1538198/15373

The production split is the load-bearing product signal. Pre-fix, Codex graph-only lost accuracy only on production-grounded fixtures. Post-fix, Codex graph-only passes all 36 cells but at higher production-side cost (24,522 median vs 14,162 no-graph) — the cost moved into the two formerly-failing fixtures, which now pass but at 8.84× and 3.72× their no-graph medians. Claude graph-only passed all production fixtures post-fix; Claude no-graph missed const-value-extraction twice.


The post-fix Codex graph-only rerun isolates the joint effect of T20260425-0729 (string-list coercion) and T20260425-0739 (pub-use re-export indexing). Identical fixture set, identical n=3 seeds-per-cell, identical provider/model (gpt-5.3-codex); only the harness SHA and graph index differ.

metricpre-fixpost-fixdelta
graph-only pass rate30/3636/36+6
graph-only median tokens15,46212,928-16%
graph-only p90 tokens64,87771,774+11%
total graph calls334438+31%
failed graph calls254-84%
schema-coercion failures260-100%

The schema-coercion class (refs.include must be array, pack.selectors must be array) is fully resolved. Total graph calls went up because Codex now successfully completes calls that previously failed and forced retries; more useful tool calls produce more downstream calls. p90 went up because the two formerly-failing fixtures now pass at high cost rather than giving up early.

fixtureclasspre-fix passpost-fix passpre-fix medianpost-fix medianpost/pre ratio
reverse-export-orbit-errorgraph-strength0/33/377,792122,9481.58x
module-surface-orbit-mcpselector-ambiguity0/33/311,13421,8851.97x
const-value-extractionpayload-volume3/33/328,22267,5152.39x
generic-dispatch-concrete-implprecision-gap3/33/315,74120,2931.29x
impl-divergence-trait-methodpayload-volume3/33/38,08112,6461.56x
callers-2hop-graphbenchpolicygraph-strength3/33/35,4448,5321.57x
deps-downstream-orbit-knowledgegraph-strength3/33/32,2502,8331.26x
implementors-benchsink-with-blanketgraph-strength3/33/37,3708,7051.18x
references-vs-callers-tool-registry-registerselector-ambiguity3/33/332,14627,6910.86x
function-as-value-vs-direct-callprecision-gap3/33/316,76113,2100.79x
macro-expanded-callersprecision-gap3/33/36,6054,3680.66x
construct-vs-match-benchevent-distinctprecision-gap3/33/315,2578,4290.55x

The two formerly-failing fixtures pass at 1.58–1.97× their pre-fix cost — pre-fix Codex was bailing early into [] once graph confidently lied, so the pre-fix token count was an under-estimate of “what it actually takes to answer this question with the graph.” Post-fix, Codex does the real work and we see the real cost.

Four fixtures got cheaper post-fix (function-as-value, macro-expanded, construct-vs-match, references-vs-callers); these are cells where the schema-coercion friction was the dominant pre-fix overhead. Six fixtures got more expensive — most by a small amount. const-value-extraction is the largest “passed-then-passed-more-expensively” gap (2.39×) and probably reflects a richer post-fix index returning more candidates that Codex enumerates through.


fixtureclassmodearmpassmedian_tokensp90_tokensgraph_call_rategraph_callsfailed_graph_callsshell/fs_calls
callers-2hop-graphbenchpolicygraph-strengthsyntheticno-graph3/35357460/3003
callers-2hop-graphbenchpolicygraph-strengthsyntheticgraph-only3/3101612833/3600
callers-2hop-graphbenchpolicygraph-strengthsynthetichybrid3/37508550/3003
const-value-extractionpayload-volumeproductionno-graph1/371911980/30014
const-value-extractionpayload-volumeproductiongraph-only3/36979115843/315800
construct-vs-match-benchevent-distinctprecision-gapsyntheticno-graph3/34816990/3003
construct-vs-match-benchevent-distinctprecision-gapsyntheticgraph-only3/3229023703/33000
construct-vs-match-benchevent-distinctprecision-gapsynthetichybrid3/37197320/3003
deps-downstream-orbit-knowledgegraph-strengthproductionno-graph3/3161019400/30019
deps-downstream-orbit-knowledgegraph-strengthproductiongraph-only3/3133819593/3300
deps-downstream-orbit-knowledgegraph-strengthproductionhybrid3/3173518883/3300
function-as-value-vs-direct-callprecision-gapproductionno-graph3/3230833710/30021
function-as-value-vs-direct-callprecision-gapproductiongraph-only3/3427346523/32400
function-as-value-vs-direct-callprecision-gapproductionhybrid3/3282329150/30024
generic-dispatch-concrete-implprecision-gapsyntheticno-graph3/31972190/3003
generic-dispatch-concrete-implprecision-gapsyntheticgraph-only3/386411963/31300
generic-dispatch-concrete-implprecision-gapsynthetichybrid3/32192390/3003
impl-divergence-trait-methodpayload-volumeproductionno-graph3/3200425080/30020
impl-divergence-trait-methodpayload-volumeproductiongraph-only3/3333234383/34300
implementors-benchsink-with-blanketgraph-strengthsyntheticno-graph3/31841840/3003
implementors-benchsink-with-blanketgraph-strengthsyntheticgraph-only3/399815373/3700
implementors-benchsink-with-blanketgraph-strengthsynthetichybrid3/33183270/3003
macro-expanded-callersprecision-gapsyntheticno-graph3/32032310/3003
macro-expanded-callersprecision-gapsyntheticgraph-only3/3154515583/32330
macro-expanded-callersprecision-gapsynthetichybrid3/32032190/3003
module-surface-orbit-mcpselector-ambiguityproductionno-graph3/3191622850/30010
module-surface-orbit-mcpselector-ambiguityproductiongraph-only3/3372043833/32810
references-vs-callers-tool-registry-registerselector-ambiguityproductionno-graph3/3469852960/30049
references-vs-callers-tool-registry-registerselector-ambiguityproductiongraph-only3/3420176453/34010
reverse-export-orbit-errorgraph-strengthproductionno-graph3/35377070/3006
reverse-export-orbit-errorgraph-strengthproductiongraph-only3/3361340863/36140
reverse-export-orbit-errorgraph-strengthproductionhybrid3/35886290/3006

The graph-only (pre-fix) rows are retained from the original Codex sweep. The graph-only (post-fix) rows are from the rerun at harness SHA 56a9c07b... after both T20260425-0729 and T20260425-0739 landed.

fixtureclassmodearmpassmedian_tokensp90_tokensgraph_call_rategraph_callsfailed_graph_callsshell/fs_calls
callers-2hop-graphbenchpolicygraph-strengthsyntheticno-graph3/312397238550/30010
callers-2hop-graphbenchpolicygraph-strengthsyntheticgraph-only (pre-fix)3/35444152803/31610
callers-2hop-graphbenchpolicygraph-strengthsyntheticgraph-only (post-fix)3/3853288583/32800
callers-2hop-graphbenchpolicygraph-strengthsynthetichybrid3/3549558913/31401
const-value-extractionpayload-volumeproductionno-graph3/3868094450/30021
const-value-extractionpayload-volumeproductiongraph-only (pre-fix)3/3282221015873/33320
const-value-extractionpayload-volumeproductiongraph-only (post-fix)3/367515743293/36120
construct-vs-match-benchevent-distinctprecision-gapsyntheticno-graph3/32818124230/3008
construct-vs-match-benchevent-distinctprecision-gapsyntheticgraph-only (pre-fix)3/315257175513/32010
construct-vs-match-benchevent-distinctprecision-gapsyntheticgraph-only (post-fix)3/38429159583/32200
construct-vs-match-benchevent-distinctprecision-gapsynthetichybrid3/3287349570/3006
deps-downstream-orbit-knowledgegraph-strengthproductionno-graph3/316379259550/30030
deps-downstream-orbit-knowledgegraph-strengthproductiongraph-only (pre-fix)3/32250113353/3600
deps-downstream-orbit-knowledgegraph-strengthproductiongraph-only (post-fix)3/32833116053/3600
deps-downstream-orbit-knowledgegraph-strengthproductionhybrid3/3138314493/3300
function-as-value-vs-direct-callprecision-gapproductionno-graph3/314162167440/30023
function-as-value-vs-direct-callprecision-gapproductiongraph-only (pre-fix)3/316761230513/32730
function-as-value-vs-direct-callprecision-gapproductiongraph-only (post-fix)3/313210179443/32600
function-as-value-vs-direct-callprecision-gapproductionhybrid3/3487487430/30021
generic-dispatch-concrete-implprecision-gapsyntheticno-graph3/311132112070/3004
generic-dispatch-concrete-implprecision-gapsyntheticgraph-only (pre-fix)3/315741248463/32210
generic-dispatch-concrete-implprecision-gapsyntheticgraph-only (post-fix)3/320293377613/33900
generic-dispatch-concrete-implprecision-gapsynthetichybrid3/3381947412/31414
impl-divergence-trait-methodpayload-volumeproductionno-graph3/316776245270/30015
impl-divergence-trait-methodpayload-volumeproductiongraph-only (pre-fix)3/38081178473/32410
impl-divergence-trait-methodpayload-volumeproductiongraph-only (post-fix)3/312646222573/34500
implementors-benchsink-with-blanketgraph-strengthsyntheticno-graph3/311469225530/3007
implementors-benchsink-with-blanketgraph-strengthsyntheticgraph-only (pre-fix)3/37370349163/33220
implementors-benchsink-with-blanketgraph-strengthsyntheticgraph-only (post-fix)3/38705101493/33400
implementors-benchsink-with-blanketgraph-strengthsynthetichybrid3/33981148733/3927
macro-expanded-callersprecision-gapsyntheticno-graph3/311278115120/3007
macro-expanded-callersprecision-gapsyntheticgraph-only (pre-fix)3/36605140983/32220
macro-expanded-callersprecision-gapsyntheticgraph-only (post-fix)3/3436846063/32000
macro-expanded-callersprecision-gapsynthetichybrid3/3228824850/3007
module-surface-orbit-mcpselector-ambiguityproductionno-graph3/3588674370/30016
module-surface-orbit-mcpselector-ambiguityproductiongraph-only (pre-fix)0/311134195823/34250
module-surface-orbit-mcpselector-ambiguityproductiongraph-only (post-fix)3/321885267813/34100
references-vs-callers-tool-registry-registerselector-ambiguityproductionno-graph3/332141514730/30036
references-vs-callers-tool-registry-registerselector-ambiguityproductiongraph-only (pre-fix)3/332146378223/32940
references-vs-callers-tool-registry-registerselector-ambiguityproductiongraph-only (post-fix)3/327691312593/33110
reverse-export-orbit-errorgraph-strengthproductionno-graph3/313912263490/30020
reverse-export-orbit-errorgraph-strengthproductiongraph-only (pre-fix)0/377792786973/36130
reverse-export-orbit-errorgraph-strengthproductiongraph-only (post-fix)3/31229481413423/38510
reverse-export-orbit-errorgraph-strengthproductionhybrid3/35830133540/30015

fixturepassmedian_tokensgraph_call_rategraph_callsshell/fs_callsinterpretation
callers-2hop-graphbenchpolicy3/37500/303Passed by shell/source fallback; graph avoided organically.
construct-vs-match-benchevent-distinct3/37190/303Passed by shell/source fallback; graph avoided organically.
deps-downstream-orbit-knowledge3/317353/330Only Claude hybrid fixture that used graph; direct deps solved it cleanly.
function-as-value-vs-direct-call3/328230/3024Passed by source fallback with relatively heavy shell/read use.
generic-dispatch-concrete-impl3/32190/303Passed by direct source inspection; graph avoided organically.
implementors-benchsink-with-blanket3/33180/303Passed by direct source inspection; graph avoided organically.
macro-expanded-callers3/32030/303Passed by direct source inspection; graph avoided organically.
reverse-export-orbit-error3/35880/306Passed by shell/source fallback despite graph-only success post-fix.

Claude hybrid shows that neutral hybrid prompting does not guarantee graph selection. It mostly measures whether Claude can route to the cheapest available source strategy, and for this fixture set that was usually shell/source reading rather than graph.

fixturepassmedian_tokensgraph_call_rategraph_callsshell/fs_callsinterpretation
callers-2hop-graphbenchpolicy3/354953/3141Used graph consistently; stayed well below no-graph.
construct-vs-match-benchevent-distinct3/328730/306Passed by shell fallback; graph avoided organically.
deps-downstream-orbit-knowledge3/313833/330Best graph-shaped win; deps solved directly.
function-as-value-vs-direct-call3/348740/3021Passed by shell fallback; graph avoided organically.
generic-dispatch-concrete-impl3/338192/3144Mixed graph use; source reading did the final disambiguation.
implementors-benchsink-with-blanket3/339813/397Used graph, then shell/source checks; cheaper than both baselines.
macro-expanded-callers3/322880/307Passed by shell fallback; graph avoided organically.
reverse-export-orbit-error3/358300/3015Passed by shell fallback; graph-only failed all seeds.

Codex hybrid’s 24/24 pass rate is not proof that every hybrid-eligible fixture is graph-shaped. It is proof that Codex can route around graph gaps when shell/source tools are available.


fixtureclassmodeno-graph mediangraph-only mediango/nggraph-only pass
deps-downstream-orbit-knowledgegraph-strengthproduction161013380.83x3/3
references-vs-callers-tool-registry-registerselector-ambiguityproduction469842010.89x3/3
impl-divergence-trait-methodpayload-volumeproduction200433321.66x3/3
function-as-value-vs-direct-callprecision-gapproduction230842731.85x3/3
callers-2hop-graphbenchpolicygraph-strengthsynthetic53510161.90x3/3
module-surface-orbit-mcpselector-ambiguityproduction191637201.94x3/3
generic-dispatch-concrete-implprecision-gapsynthetic1978644.39x3/3
construct-vs-match-benchevent-distinctprecision-gapsynthetic48122904.76x3/3
implementors-benchsink-with-blanketgraph-strengthsynthetic1849985.42x3/3
reverse-export-orbit-errorgraph-strengthproduction53736136.73x3/3
macro-expanded-callersprecision-gapsynthetic20315457.61x3/3
const-value-extractionpayload-volumeproduction71969799.71x3/3

Claude graph-only was excellent for accuracy after the graph fixes, but it was rarely the cheapest route. The const-value-extraction cell is the sharpest tradeoff: graph-only found the full set in every seed, while no-graph missed one constant twice, but graph-only used 9.71x the no-graph median tokens.

fixtureclassmodeno-graph mediangraph-only mediango/nggraph-only pass
deps-downstream-orbit-knowledgegraph-strengthproduction1637922500.14x3/3
callers-2hop-graphbenchpolicygraph-strengthsynthetic1239754440.44x3/3
impl-divergence-trait-methodpayload-volumeproduction1677680810.48x3/3
macro-expanded-callersprecision-gapsynthetic1127866050.59x3/3
implementors-benchsink-with-blanketgraph-strengthsynthetic1146973700.64x3/3
references-vs-callers-tool-registry-registerselector-ambiguityproduction32141321461.00x3/3
function-as-value-vs-direct-callprecision-gapproduction14162167611.18x3/3
generic-dispatch-concrete-implprecision-gapsynthetic11132157411.41x3/3
module-surface-orbit-mcpselector-ambiguityproduction5886111341.89x0/3
const-value-extractionpayload-volumeproduction8680282223.25x3/3
construct-vs-match-benchevent-distinctprecision-gapsynthetic2818152575.41x3/3
reverse-export-orbit-errorgraph-strengthproduction13912777925.59x0/3
fixtureclassmodeno-graph mediangraph-only mediango/nggraph-only pass
deps-downstream-orbit-knowledgegraph-strengthproduction1637928330.17x3/3
macro-expanded-callersprecision-gapsynthetic1127843680.39x3/3
construct-vs-match-benchevent-distinctprecision-gapsynthetic281884292.99x3/3
callers-2hop-graphbenchpolicygraph-strengthsynthetic1239785320.69x3/3
impl-divergence-trait-methodpayload-volumeproduction16776126460.75x3/3
implementors-benchsink-with-blanketgraph-strengthsynthetic1146987050.76x3/3
references-vs-callers-tool-registry-registerselector-ambiguityproduction32141276910.86x3/3
function-as-value-vs-direct-callprecision-gapproduction14162132100.93x3/3
generic-dispatch-concrete-implprecision-gapsynthetic11132202931.82x3/3
module-surface-orbit-mcpselector-ambiguityproduction5886218853.72x3/3
const-value-extractionpayload-volumeproduction8680675157.78x3/3
reverse-export-orbit-errorgraph-strengthproduction139121229488.84x3/3

Post-fix Codex graph-only beats no-graph on 7/12 fixtures (vs 5/12 pre-fix). The losses are concentrated in two fixtures where pre-fix Codex was bailing early into wrong answers (reverse-export, module-surface); post-fix it does the real work and the cost surfaces. The third large outlier is const-value-extraction (7.78x) — a payload-volume fixture where graph enumeration is structurally complete but expensive on a value-extraction task.


Per-tool response size is measured in response characters, not model tokens. The transcripts do not expose per-tool token attribution.

toolinvocationssucceededfailedsuccess_ratemedian_response_charsp90_response_chars
callers74357%183829660
refs1410471%394518990
implementors660100%11221249
deps660100%606606
pack21150%979979
search80800100%2121427
show3243231100%7622931
overview000n/a--

Failed graph calls by message:

messagecountaffected tools
invalid input: include entries must be code, doc, config, or all, got ["code"]3refs
execution failed: selector BenchDerivedStruct::default:fn does not resolve to a node2callers
execution failed: selector BenchDerivedStruct::default does not resolve to a node1callers
invalid input: invalid selector ["file:crates/orbit-mcp/src/lib.rs"]1pack
invalid input: selector file:crates/orbit-common/src/types.rs does not resolve to a node1show
invalid input: include entries must be code, doc, config, or all, got ["code", "config"]1refs

These failures are not the same shape as the pre-fix Codex scalar-list failures. The remaining Claude failures are nested-list/invalid-selector mistakes and a derive/default selector expectation that graph does not support.

Per-tool response size is measured in response characters, not model tokens. The transcripts do not expose per-tool token attribution.

toolinvocationssucceededfailedsuccess_ratemedian_response_charsp90_response_chars
callers16160100%1838382947
refs61392264%49490616
implementors1312192%11221249
deps660100%606606
pack6965494%172419595
search103102199%2602653
show84840100%7771587
overview22220100%243659717

Failed graph calls by message:

messagecountaffected tools
invalid input: include must be an array of strings22refs
invalid input: selectors must be an array4pack
invalid input: query must not be empty1search
invalid input: invalid selector BenchAuditSink1implementors

The refs.include and pack.selectors scalar-list failures were addressed by T20260425-0729 before the post-fix rerun.

toolinvocationssucceededfailedsuccess_ratemedian_response_charsp90_response_chars
show1311310100%21814018
pack1021020100%626153902
search10199298%61714112
refs4543295%1941116359
overview24240100%3626128204
callers23230100%3967319274
implementors990100%21812689
deps330100%13961396

Failed graph calls by message:

messagecountaffected tools
invalid input: query must not be empty2search
invalid input: invalid selector ToolRegistry::register: selectors must start with dir:, file:, or symbol:1refs
invalid input: invalid selector OrbitError: selectors must start with dir:, file:, or symbol:1refs

The schema-coercion class (scalar-as-array) is fully resolved — 26 of the pre-fix 28 failures are gone. Remaining failures are query/selector ergonomics: empty query strings, and missing symbol:/file:/dir: prefixes on selectors. Both classes are recoverable by the agent on retry, but the latter is the same shape Claude hits post-fix and is a candidate for the next ergonomics task.

Tool-mix shift is also notable: post-fix Codex now leans on show (131 invocations vs 84 pre-fix) and pack (102 vs 69) — direct file/symbol inspection — instead of refs (45 vs 61) where the schema-coercion friction lived.


Non-passing runs:

providerarmfixtureseedsclassificationobserved answer
claudeno-graphconst-value-extraction1, 2source-search missomitted V2_TOOL_WILDCARD_ROOTS; seed 3 found the full set
codexgraph-only (pre-fix)module-surface-orbit-mcp1, 2, 3known graph bug / root-surface gap (T20260425-0739)returned McpHost, serve_stdio; excluded OrbitToolServer. Resolved post-fix: 3/3 pass.
codexgraph-only (pre-fix)reverse-export-orbit-error1, 2, 3known graph bug / re-export metadata gap (T20260425-0739)returned []; excluded the original definition. Resolved post-fix: 3/3 pass.

Anomaly flags are not mutually exclusive. Primary means the row count emitted by aggregate.py’s precedence-ordered taxonomy; independent means the flag was true even if another flag won precedence.

providerflagrunsnotes
claudeschema-coercion8 primary9 failed graph calls total; all recovered. Remaining shapes are nested-list/invalid-selector errors, not the pre-fix scalar-list issue.
claudepayload-firehose7 primary / 13 independentConcentrated in graph-only const-value-extraction, macro-expanded-callers, reverse-export-orbit-error, implementors-benchsink-with-blanket, and one generic-dispatch-concrete-impl seed.
claudewrong-tool0No graph-only Claude run failed.
claudedesign-defect21Hybrid passed with zero graph calls. Interpret as “organic selection avoided graph”, not as a correctness failure.
codex (pre-fix)schema-coercion25 primary28 failed graph calls total; most were recovered by retrying with array-shaped args.
codex (pre-fix)payload-firehose2 primary / 6 independentPrimary taxonomy hides several firehose runs behind schema-coercion.
codex (pre-fix)wrong-tool6The six pre-fix graph-only non-passing runs above.
codex (pre-fix)design-defect13Hybrid passed with zero graph calls. Interpret as “organic selection avoided graph”, not as a correctness failure.
codex (post-fix)schema-coercion4 primaryAll 4 are query/selector ergonomics (empty query, missing dir:/file:/symbol: prefix), not the pre-fix scalar-list class.
codex (post-fix)payload-firehose4 primaryConcentrated in const-value-extraction and reverse-export-orbit-error — fixtures where graph enumerates many candidates and Codex pages through them.
codex (post-fix)wrong-tool0All 36 graph-only cells passed.
codex (post-fix)design-defect13Same hybrid runs as pre-fix; hybrid was not rerun.

Top graph-only wins by token reduction:

providerfixtureresult
codex (pre-fix)deps-downstream-orbit-knowledge3/3 pass, 0.14x no-graph tokens
codex (post-fix)deps-downstream-orbit-knowledge3/3 pass, 0.17x no-graph tokens
codex (post-fix)macro-expanded-callers3/3 pass, 0.39x no-graph tokens
codex (pre-fix)callers-2hop-graphbenchpolicy3/3 pass, 0.44x no-graph tokens
codex (pre-fix)impl-divergence-trait-method3/3 pass, 0.48x no-graph tokens
claudedeps-downstream-orbit-knowledge3/3 pass, 0.83x no-graph tokens
claudereferences-vs-callers-tool-registry-register3/3 pass, 0.89x no-graph tokens

Accuracy standouts:

providerfixtureresult
claude + codex (post-fix)reverse-export-orbit-errorboth providers graph-only 3/3 post-fix; Codex pre-fix was 0/3
claude + codex (post-fix)module-surface-orbit-mcpboth providers graph-only 3/3 post-fix; Codex pre-fix was 0/3
claudeconst-value-extractiongraph-only 3/3 while no-graph was 1/3

Top graph-only losses:

providerfixtureresult
codex (post-fix)reverse-export-orbit-error3/3 pass, 8.84x no-graph tokens (pre-fix was 0/3 at 5.59x)
codex (post-fix)const-value-extraction3/3 pass, 7.78x no-graph tokens
claudeconst-value-extraction3/3 pass, 9.71x no-graph tokens
claudemacro-expanded-callers3/3 pass, 7.61x no-graph tokens
claudereverse-export-orbit-error3/3 pass post-fix, 6.73x no-graph tokens
codex (post-fix)module-surface-orbit-mcp3/3 pass, 3.72x no-graph tokens (pre-fix was 0/3 at 1.89x)

The full v4 result — now with both providers post-fix on graph-only — supports keeping graph as an optional navigation surface, not as a replacement for source reads. Graph is excellent when the question maps directly to a precise graph primitive, with deps-downstream-orbit-knowledge the cleanest repeated win across both providers (0.17x for Codex post-fix, 0.83x for Claude).

T20260425-0739 was the right diagnosis. The post-fix Codex rerun confirms it directly: both reverse-export-orbit-error and module-surface-orbit-mcp flipped from 0/3 to 3/3 with no other changes. With Codex now post-fix, the Codex-vs-Claude comparison is clean — and the residual cost difference (Codex graph-only median 12,928 vs Claude graph-only median 2,330) is the load-bearing provider-behavior signal, not a tool-bug artifact.

The post-fix data also exposes a subtler pattern: fixing the bug increased cost on the formerly-failing fixtures. Pre-fix Codex bailed early on reverse-export-orbit-error (~78k tokens to fail); post-fix it does the real work and pays ~123k tokens to succeed. The pre-fix cost ratios on those two fixtures were under-estimates of “what graph-only actually costs to answer this question.” Post-fix cost ratios are the honest reading.

Hybrid is still the practical success case, but its meaning differs by provider. Codex selectively used graph and got the strongest overall cost/correctness profile on the hybrid subset. Claude mostly avoided graph in hybrid, so its 24/24 result is better read as “source fallback remains essential” than “graph was selected well.” The post-fix rerun does not change this — Codex hybrid was not rerun, but its 24/24 + 11/24 graph-call rate is unchallenged.

The highest-leverage next steps are:

  1. Add payload shaping for high-cardinality responses (refs, overview, callers, and repeated show) so graph-only cannot spend 6x-10x tokens on enumeration. The post-fix reverse-export (8.84x) and const-value-extraction (7.78x) cells are the load-bearing examples.
  2. Tighten selector ergonomics: post-fix Codex still hit 4 failed graph calls — 2 empty queries and 2 unprefixed selectors (OrbitError, ToolRegistry::register instead of symbol:OrbitError). Claude hits the same shape post-fix. Both providers want a forgiving selector parser.
  3. Add a small hybrid-selection round with explicit “prefer graph when it directly answers the relationship; fall back to source for bodies/values” guidance. Neutral hybrid prompts measure organic tool choice, and Claude’s organic choice was mostly “do not use graph.”
  4. Optionally rerun Codex hybrid post-fix to close out tool-fix confound entirely. Expected delta: 3 failed graph calls → 0; pass rate stays at 24/24; median tokens drop slightly. Skipped here on cost grounds.
  5. Keep fixture-level ratios as the main cost metric. Aggregate medians hide both the deps win and the expensive-but-correct post-fix reverse-export result.

Aggregate tables:

Terminal window
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v4/runs \
--tasks benchmarks/graph/v4/tasks

Completed Codex sweeps:

The original no-graph and graph-only Codex artifacts are pre-fix for T20260425-0729 and T20260425-0739. Codex graph-only was rerun post-fix on 2026-04-25; the pre-fix graph-only artifacts now live at _archive/codex-graph-only-pre-fix-T20260425-0739/.

Terminal window
# Pre-fix Codex no-graph + graph-only (graph-only artifacts later archived)
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider codex --arms no-graph graph-only --n 3
Terminal window
# Pre-fix Codex hybrid (not rerun post-fix; passed 24/24)
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider codex --arms hybrid --n 3 \
--tasks callers-2hop-graphbenchpolicy construct-vs-match-benchevent-distinct \
deps-downstream-orbit-knowledge function-as-value-vs-direct-call \
generic-dispatch-concrete-impl implementors-benchsink-with-blanket \
macro-expanded-callers reverse-export-orbit-error
Terminal window
# Post-fix Codex graph-only rerun (after archiving pre-fix dir)
mv benchmarks/graph/v4/runs/codex/graph-only \
benchmarks/graph/v4/_archive/codex-graph-only-pre-fix-T20260425-0739
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider codex --arms graph-only --n 3

To re-aggregate the pre-fix data, temporarily symlink the archived dir back into runs/codex/graph-only (or pass a different --runs root pointing at _archive/).

Completed Claude sweeps:

Terminal window
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider claude --arms no-graph graph-only --n 3
Terminal window
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \
--provider claude --arms hybrid --n 3 \
--tasks callers-2hop-graphbenchpolicy construct-vs-match-benchevent-distinct \
deps-downstream-orbit-knowledge function-as-value-vs-direct-call \
generic-dispatch-concrete-impl implementors-benchsink-with-blanket \
macro-expanded-callers reverse-export-orbit-error