Graph Token-Usage Benchmark - v4 Results
Mirrored from
benchmarks/graph/v4/RESULTS.md. Edit the source document in the repository, not this generated page.
Status: Complete for Codex and Claude. Codex graph-only rerun post-fix on 2026-04-25 to remove the tool-fix confound.
Sweep dates: 2026-04-24 to 2026-04-25
Codex sweep IDs: 20260424-230632-f36f84 (pre-fix no-graph, graph-only), 20260425-001959-842b8c (pre-fix hybrid), 20260425-115117-21a938 (post-fix graph-only)
Claude sweep IDs: 20260425-013339-cfac6a (no-graph, graph-only), 20260425-012511-8881cb (hybrid)
Codex sweep seeds: 928176111 (pre-fix no-graph, graph-only), 583229300 (pre-fix hybrid), 142643867 (post-fix graph-only)
Claude sweep seeds: 152346771 (no-graph, graph-only), 88160319 (hybrid)
Harness SHAs: Codex pre-fix b0ce189e7053409c8754865bd154cd20e1de66a6; Claude post-fix run / Codex post-fix graph-only rerun 56a9c07b64479360f9a64ca94b40721f76226014 (post-fix branch contains both T20260425-0729 and T20260425-0739).
Scope: 228 cells total. Each provider ran no-graph 12 fixtures x 3 seeds, graph-only 12 fixtures x 3 seeds (Codex’s was run twice — pre- and post-fix), and hybrid 8 fixtures x 3 seeds. No errored cells.
Comparison caveat: Codex no-graph and hybrid are pre-fix; Codex graph-only was rerun post-fix and is the canonical Codex-vs-Claude comparison row. Pre-fix Codex graph-only artifacts are preserved at benchmarks/graph/v4/_archive/codex-graph-only-pre-fix-T20260425-0739/ and remain available as a “what the bug looked like” reference. Codex no-graph was not rerun (unaffected by either fix). Codex hybrid was not rerun (passed 24/24 pre-fix; the only delta would be 3 failed graph calls becoming 0).
Headline
Section titled “Headline”- Codex pre-fix:
no-graphpassed 36/36,graph-onlypassed 30/36, andhybridpassed 24/24. All six Codex graph-only failures came frommodule-surface-orbit-mcpandreverse-export-orbit-error, the two fixtures most exposed to theT20260425-0739re-export bug. - Codex post-fix (graph-only rerun):
graph-onlypassed 36/36 (was 30/36), median 12,928 tokens (was 15,462), failed graph calls dropped to 4 (was 25). The two formerly-failing fixtures both flipped to 3/3 — confirmingT20260425-0739was the right diagnosis. Schema-coercion churn was eliminated byT20260425-0729. - Claude post-fix:
graph-onlypassed 36/36,hybridpassed 24/24, andno-graphpassed 34/36. The two Claude no-graph failures were bothconst-value-extractionruns that omittedV2_TOOL_WILDCARD_ROOTS. - The fix is symmetric across providers: post-fix, both Codex and Claude graph-only pass
reverse-export-orbit-errorandmodule-surface-orbit-mcp3/3. The Codex-vs-Claude comparison is now clean — no tool-bug confound. - Graph-only accuracy improved post-fix, but cost shifted unevenly: the two formerly-failing fixtures became expensive-passes (Codex
reverse-export-orbit-error122,948 tokens vs 13,912 no-graph;module-surface-orbit-mcp21,885 vs 5,886). Other fixtures got cheaper because schema-coercion retries are gone. Net: median dropped, p90 stayed high. - Hybrid remains the practical operating mode, but providers route differently: Codex hybrid used graph in 11/24 runs and passed all seeds. Claude hybrid used graph in only 3/24 runs, all on
deps-downstream-orbit-knowledge, and passed the rest via shell/source fallback. - The remaining graph work is payload shaping and selector ergonomics: post-fix Codex still hit 4 failed graph calls (2 empty-query searches, 2 unprefixed selectors). Claude post-fix hit 9 (nested-list/invalid-selector shapes). Both classes are recoverable but waste tokens on retry cycles.
Arm Summary
Section titled “Arm Summary”Token totals are input_tokens + output_tokens, matching the aggregator’s marginal-token convention. Cached read tokens and Claude USD cost are reported separately by the raw records, but not included in the median-token columns.
| provider | arm | runs | pass | median_total_tokens | p90_total_tokens | graph_call_rate | graph_calls | failed_graph_calls | shell_or_fs_calls |
|---|---|---|---|---|---|---|---|---|---|
| claude | no-graph | 36 | 34/36 | 713 | 3159 | 0/36 | 0 | 0 | 154 |
| claude | graph-only | 36 | 36/36 | 2330 | 6866 | 36/36 | 436 | 9 | 0 |
| claude | hybrid | 24 | 24/24 | 663 | 2449 | 3/24 | 3 | 0 | 45 |
| codex | no-graph | 36 | 36/36 | 11446 | 27792 | 0/36 | 0 | 0 | 197 |
| codex | graph-only (pre-fix) | 36 | 30/36 | 15462 | 64877 | 36/36 | 334 | 25 | 0 |
| codex | graph-only (post-fix) | 36 | 36/36 | 12928 | 71774 | 36/36 | 438 | 4 | 0 |
| codex | hybrid | 24 | 24/24 | 3900 | 11048 | 11/24 | 40 | 3 | 61 |
Hybrid only ran on the 8 graph-strength and precision-gap fixtures, per METHOD.md. On that same 24-run subset:
| provider | no-graph median | graph-only median | hybrid median |
|---|---|---|---|
| claude | 491 | 1541 | 663 |
| codex | 11446 | 15114 | 3900 |
Primary Aggregate
Section titled “Primary Aggregate”Verbatim from:
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/aggregate.py \ --runs benchmarks/graph/v4/runs \ --tasks benchmarks/graph/v4/tasks| provider | arm | task_class | runs | pass_rate | median_total_tokens | p90_total_tokens | tokens_per_success | graph_calls | graph_call_rate | shell_or_fs_calls |
|---|---|---|---|---|---|---|---|---|---|---|
| claude | graph-only | graph-strength | 12 | 100% | 1310 | 3944 | 1871 | 77 | 12/12 = 100.0% | 0 |
| claude | graph-only | payload-volume | 6 | 100% | 5128 | 11584 | 5636 | 201 | 6/6 = 100.0% | 0 |
| claude | graph-only | precision-gap | 12 | 100% | 1854 | 4538 | 2252 | 90 | 12/12 = 100.0% | 0 |
| claude | graph-only | selector-ambiguity | 6 | 100% | 3960 | 7645 | 4308 | 68 | 6/6 = 100.0% | 0 |
| claude | hybrid | graph-strength | 12 | 100% | 669 | 1842 | 818 | 3 | 3/12 = 25.0% | 12 |
| claude | hybrid | precision-gap | 12 | 100% | 468 | 2887 | 935 | 0 | 0/12 = 0.0% | 33 |
| claude | no-graph | graph-strength | 12 | 100% | 536 | 1841 | 737 | 0 | 0/12 = 0.0% | 31 |
| claude | no-graph | payload-volume | 6 | 67% | 1517 | 2508 | 2207 | 0 | 0/6 = 0.0% | 34 |
| claude | no-graph | precision-gap | 12 | 100% | 343 | 3052 | 874 | 0 | 0/12 = 0.0% | 30 |
| claude | no-graph | selector-ambiguity | 6 | 100% | 2677 | 5296 | 3188 | 0 | 0/6 = 0.0% | 59 |
| codex | graph-only (pre-fix) | graph-strength | 12 | 75% | 9352 | 78425 | 33907 | 115 | 12/12 = 100.0% | 0 |
| codex | graph-only (pre-fix) | payload-volume | 6 | 100% | 16855 | 101587 | 29067 | 57 | 6/6 = 100.0% | 0 |
| codex | graph-only (pre-fix) | precision-gap | 12 | 100% | 15450 | 24307 | 14631 | 91 | 12/12 = 100.0% | 0 |
| codex | graph-only (pre-fix) | selector-ambiguity | 6 | 50% | 21961 | 37822 | 44688 | 71 | 6/6 = 100.0% | 0 |
| codex | graph-only (post-fix) | graph-strength | 12 | 100% | 8822 | 139043 | 36989 | 153 | 12/12 = 100.0% | 0 |
| codex | graph-only (post-fix) | payload-volume | 6 | 100% | 41726 | 76033 | 41984 | 106 | 6/6 = 100.0% | 0 |
| codex | graph-only (post-fix) | precision-gap | 12 | 100% | 8813 | 35578 | 12930 | 107 | 12/12 = 100.0% | 0 |
| codex | graph-only (post-fix) | selector-ambiguity | 6 | 100% | 26106 | 32152 | 24042 | 72 | 6/6 = 100.0% | 0 |
| codex | hybrid | graph-strength | 12 | 100% | 4232 | 14417 | 5394 | 26 | 9/12 = 75.0% | 23 |
| codex | hybrid | precision-gap | 12 | 100% | 3346 | 7607 | 3871 | 14 | 2/12 = 16.7% | 38 |
| codex | no-graph | graph-strength | 12 | 100% | 13154 | 26230 | 14676 | 0 | 0/12 = 0.0% | 67 |
| codex | no-graph | payload-volume | 6 | 100% | 12434 | 24527 | 13374 | 0 | 0/6 = 0.0% | 36 |
| codex | no-graph | precision-gap | 12 | 100% | 11169 | 15969 | 8733 | 0 | 0/12 = 0.0% | 42 |
| codex | no-graph | selector-ambiguity | 6 | 100% | 19298 | 51473 | 21861 | 0 | 0/6 = 0.0% | 52 |
Category Aggregate
Section titled “Category Aggregate”median go/ng compares aggregate category medians. mean fixture go/ng and worst fixture go/ng are per-fixture ratios and are the load-bearing cost readings. Lower ratios are better for graph-only.
| provider | category | no-graph pass | graph-only pass | pass delta | median go/ng | mean fixture go/ng | worst fixture go/ng | hybrid graph rate |
|---|---|---|---|---|---|---|---|---|
| claude | graph-strength | 12/12 | 12/12 | +0 | 2.44x | 3.72x | 6.73x | 3/12 |
| claude | precision-gap | 12/12 | 12/12 | +0 | 5.41x | 4.65x | 7.61x | 0/12 |
| claude | payload-volume | 4/6 | 6/6 | +2 | 3.38x | 5.68x | 9.71x | n/a |
| claude | selector-ambiguity | 6/6 | 6/6 | +0 | 1.48x | 1.42x | 1.94x | n/a |
| codex (pre-fix) | graph-strength | 12/12 | 9/12 | -3 | 0.71x | 1.70x | 5.59x | 9/12 |
| codex (pre-fix) | precision-gap | 12/12 | 12/12 | +0 | 1.38x | 2.15x | 5.41x | 2/12 |
| codex (pre-fix) | payload-volume | 6/6 | 6/6 | +0 | 1.36x | 1.87x | 3.25x | n/a |
| codex (pre-fix) | selector-ambiguity | 6/6 | 3/6 | -3 | 1.14x | 1.45x | 1.89x | n/a |
| codex (post-fix) | graph-strength | 12/12 | 12/12 | +0 | 0.67x | 2.61x | 8.84x | 9/12 |
| codex (post-fix) | precision-gap | 12/12 | 12/12 | +0 | 0.79x | 1.53x | 2.99x | 2/12 |
| codex (post-fix) | payload-volume | 6/6 | 6/6 | +0 | 3.36x | 4.27x | 7.78x | n/a |
| codex (post-fix) | selector-ambiguity | 6/6 | 6/6 | +0 | 1.35x | 2.29x | 3.72x | n/a |
Production vs Synthetic
Section titled “Production vs Synthetic”| provider | mode | arm | runs | pass | median_total_tokens | graph_call_rate | graph_calls | failed_graph_calls |
|---|---|---|---|---|---|---|---|---|
| claude | production | no-graph | 21 | 19/21 | 1916 | 0/21 | 0 | 0 |
| claude | production | graph-only | 21 | 21/21 | 3720 | 21/21 | 357 | 6 |
| claude | production | hybrid | 9 | 9/9 | 1735 | 3/9 | 3 | 0 |
| claude | synthetic | no-graph | 15 | 15/15 | 219 | 0/15 | 0 | 0 |
| claude | synthetic | graph-only | 15 | 15/15 | 1283 | 15/15 | 79 | 3 |
| claude | synthetic | hybrid | 15 | 15/15 | 318 | 0/15 | 0 | 0 |
| codex | production | no-graph | 21 | 21/21 | 14162 | 0/21 | 0 | 0 |
| codex | production | graph-only (pre-fix) | 21 | 15/21 | 17847 | 21/21 | 222 | 18 |
| codex | production | graph-only (post-fix) | 21 | 21/21 | 24522 | 21/21 | 295 | 4 |
| codex | production | hybrid | 9 | 9/9 | 4872 | 3/9 | 3 | 0 |
| codex | synthetic | no-graph | 15 | 15/15 | 11278 | 0/15 | 0 | 0 |
| codex | synthetic | graph-only (pre-fix) | 15 | 15/15 | 14972 | 15/15 | 112 | 7 |
| codex | synthetic | graph-only (post-fix) | 15 | 15/15 | 8429 | 15/15 | 143 | 0 |
| codex | synthetic | hybrid | 15 | 15/15 | 3819 | 8/15 | 37 | 3 |
The production split is the load-bearing product signal. Pre-fix, Codex graph-only lost accuracy only on production-grounded fixtures. Post-fix, Codex graph-only passes all 36 cells but at higher production-side cost (24,522 median vs 14,162 no-graph) — the cost moved into the two formerly-failing fixtures, which now pass but at 8.84× and 3.72× their no-graph medians. Claude graph-only passed all production fixtures post-fix; Claude no-graph missed const-value-extraction twice.
Codex Pre-Fix vs Post-Fix Comparison
Section titled “Codex Pre-Fix vs Post-Fix Comparison”The post-fix Codex graph-only rerun isolates the joint effect of T20260425-0729 (string-list coercion) and T20260425-0739 (pub-use re-export indexing). Identical fixture set, identical n=3 seeds-per-cell, identical provider/model (gpt-5.3-codex); only the harness SHA and graph index differ.
Aggregate
Section titled “Aggregate”| metric | pre-fix | post-fix | delta |
|---|---|---|---|
| graph-only pass rate | 30/36 | 36/36 | +6 |
| graph-only median tokens | 15,462 | 12,928 | -16% |
| graph-only p90 tokens | 64,877 | 71,774 | +11% |
| total graph calls | 334 | 438 | +31% |
| failed graph calls | 25 | 4 | -84% |
| schema-coercion failures | 26 | 0 | -100% |
The schema-coercion class (refs.include must be array, pack.selectors must be array) is fully resolved. Total graph calls went up because Codex now successfully completes calls that previously failed and forced retries; more useful tool calls produce more downstream calls. p90 went up because the two formerly-failing fixtures now pass at high cost rather than giving up early.
Per-fixture flips and regressions
Section titled “Per-fixture flips and regressions”| fixture | class | pre-fix pass | post-fix pass | pre-fix median | post-fix median | post/pre ratio |
|---|---|---|---|---|---|---|
reverse-export-orbit-error | graph-strength | 0/3 | 3/3 | 77,792 | 122,948 | 1.58x |
module-surface-orbit-mcp | selector-ambiguity | 0/3 | 3/3 | 11,134 | 21,885 | 1.97x |
const-value-extraction | payload-volume | 3/3 | 3/3 | 28,222 | 67,515 | 2.39x |
generic-dispatch-concrete-impl | precision-gap | 3/3 | 3/3 | 15,741 | 20,293 | 1.29x |
impl-divergence-trait-method | payload-volume | 3/3 | 3/3 | 8,081 | 12,646 | 1.56x |
callers-2hop-graphbenchpolicy | graph-strength | 3/3 | 3/3 | 5,444 | 8,532 | 1.57x |
deps-downstream-orbit-knowledge | graph-strength | 3/3 | 3/3 | 2,250 | 2,833 | 1.26x |
implementors-benchsink-with-blanket | graph-strength | 3/3 | 3/3 | 7,370 | 8,705 | 1.18x |
references-vs-callers-tool-registry-register | selector-ambiguity | 3/3 | 3/3 | 32,146 | 27,691 | 0.86x |
function-as-value-vs-direct-call | precision-gap | 3/3 | 3/3 | 16,761 | 13,210 | 0.79x |
macro-expanded-callers | precision-gap | 3/3 | 3/3 | 6,605 | 4,368 | 0.66x |
construct-vs-match-benchevent-distinct | precision-gap | 3/3 | 3/3 | 15,257 | 8,429 | 0.55x |
The two formerly-failing fixtures pass at 1.58–1.97× their pre-fix cost — pre-fix Codex was bailing early into [] once graph confidently lied, so the pre-fix token count was an under-estimate of “what it actually takes to answer this question with the graph.” Post-fix, Codex does the real work and we see the real cost.
Four fixtures got cheaper post-fix (function-as-value, macro-expanded, construct-vs-match, references-vs-callers); these are cells where the schema-coercion friction was the dominant pre-fix overhead. Six fixtures got more expensive — most by a small amount. const-value-extraction is the largest “passed-then-passed-more-expensively” gap (2.39×) and probably reflects a richer post-fix index returning more candidates that Codex enumerates through.
Claude Per-Fixture Table
Section titled “Claude Per-Fixture Table”| fixture | class | mode | arm | pass | median_tokens | p90_tokens | graph_call_rate | graph_calls | failed_graph_calls | shell/fs_calls |
|---|---|---|---|---|---|---|---|---|---|---|
callers-2hop-graphbenchpolicy | graph-strength | synthetic | no-graph | 3/3 | 535 | 746 | 0/3 | 0 | 0 | 3 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | graph-only | 3/3 | 1016 | 1283 | 3/3 | 6 | 0 | 0 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | hybrid | 3/3 | 750 | 855 | 0/3 | 0 | 0 | 3 |
const-value-extraction | payload-volume | production | no-graph | 1/3 | 719 | 1198 | 0/3 | 0 | 0 | 14 |
const-value-extraction | payload-volume | production | graph-only | 3/3 | 6979 | 11584 | 3/3 | 158 | 0 | 0 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | no-graph | 3/3 | 481 | 699 | 0/3 | 0 | 0 | 3 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | graph-only | 3/3 | 2290 | 2370 | 3/3 | 30 | 0 | 0 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | hybrid | 3/3 | 719 | 732 | 0/3 | 0 | 0 | 3 |
deps-downstream-orbit-knowledge | graph-strength | production | no-graph | 3/3 | 1610 | 1940 | 0/3 | 0 | 0 | 19 |
deps-downstream-orbit-knowledge | graph-strength | production | graph-only | 3/3 | 1338 | 1959 | 3/3 | 3 | 0 | 0 |
deps-downstream-orbit-knowledge | graph-strength | production | hybrid | 3/3 | 1735 | 1888 | 3/3 | 3 | 0 | 0 |
function-as-value-vs-direct-call | precision-gap | production | no-graph | 3/3 | 2308 | 3371 | 0/3 | 0 | 0 | 21 |
function-as-value-vs-direct-call | precision-gap | production | graph-only | 3/3 | 4273 | 4652 | 3/3 | 24 | 0 | 0 |
function-as-value-vs-direct-call | precision-gap | production | hybrid | 3/3 | 2823 | 2915 | 0/3 | 0 | 0 | 24 |
generic-dispatch-concrete-impl | precision-gap | synthetic | no-graph | 3/3 | 197 | 219 | 0/3 | 0 | 0 | 3 |
generic-dispatch-concrete-impl | precision-gap | synthetic | graph-only | 3/3 | 864 | 1196 | 3/3 | 13 | 0 | 0 |
generic-dispatch-concrete-impl | precision-gap | synthetic | hybrid | 3/3 | 219 | 239 | 0/3 | 0 | 0 | 3 |
impl-divergence-trait-method | payload-volume | production | no-graph | 3/3 | 2004 | 2508 | 0/3 | 0 | 0 | 20 |
impl-divergence-trait-method | payload-volume | production | graph-only | 3/3 | 3332 | 3438 | 3/3 | 43 | 0 | 0 |
implementors-benchsink-with-blanket | graph-strength | synthetic | no-graph | 3/3 | 184 | 184 | 0/3 | 0 | 0 | 3 |
implementors-benchsink-with-blanket | graph-strength | synthetic | graph-only | 3/3 | 998 | 1537 | 3/3 | 7 | 0 | 0 |
implementors-benchsink-with-blanket | graph-strength | synthetic | hybrid | 3/3 | 318 | 327 | 0/3 | 0 | 0 | 3 |
macro-expanded-callers | precision-gap | synthetic | no-graph | 3/3 | 203 | 231 | 0/3 | 0 | 0 | 3 |
macro-expanded-callers | precision-gap | synthetic | graph-only | 3/3 | 1545 | 1558 | 3/3 | 23 | 3 | 0 |
macro-expanded-callers | precision-gap | synthetic | hybrid | 3/3 | 203 | 219 | 0/3 | 0 | 0 | 3 |
module-surface-orbit-mcp | selector-ambiguity | production | no-graph | 3/3 | 1916 | 2285 | 0/3 | 0 | 0 | 10 |
module-surface-orbit-mcp | selector-ambiguity | production | graph-only | 3/3 | 3720 | 4383 | 3/3 | 28 | 1 | 0 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | no-graph | 3/3 | 4698 | 5296 | 0/3 | 0 | 0 | 49 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | graph-only | 3/3 | 4201 | 7645 | 3/3 | 40 | 1 | 0 |
reverse-export-orbit-error | graph-strength | production | no-graph | 3/3 | 537 | 707 | 0/3 | 0 | 0 | 6 |
reverse-export-orbit-error | graph-strength | production | graph-only | 3/3 | 3613 | 4086 | 3/3 | 61 | 4 | 0 |
reverse-export-orbit-error | graph-strength | production | hybrid | 3/3 | 588 | 629 | 0/3 | 0 | 0 | 6 |
Codex Per-Fixture Table
Section titled “Codex Per-Fixture Table”The graph-only (pre-fix) rows are retained from the original Codex sweep. The graph-only (post-fix) rows are from the rerun at harness SHA 56a9c07b... after both T20260425-0729 and T20260425-0739 landed.
| fixture | class | mode | arm | pass | median_tokens | p90_tokens | graph_call_rate | graph_calls | failed_graph_calls | shell/fs_calls |
|---|---|---|---|---|---|---|---|---|---|---|
callers-2hop-graphbenchpolicy | graph-strength | synthetic | no-graph | 3/3 | 12397 | 23855 | 0/3 | 0 | 0 | 10 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | graph-only (pre-fix) | 3/3 | 5444 | 15280 | 3/3 | 16 | 1 | 0 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | graph-only (post-fix) | 3/3 | 8532 | 8858 | 3/3 | 28 | 0 | 0 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | hybrid | 3/3 | 5495 | 5891 | 3/3 | 14 | 0 | 1 |
const-value-extraction | payload-volume | production | no-graph | 3/3 | 8680 | 9445 | 0/3 | 0 | 0 | 21 |
const-value-extraction | payload-volume | production | graph-only (pre-fix) | 3/3 | 28222 | 101587 | 3/3 | 33 | 2 | 0 |
const-value-extraction | payload-volume | production | graph-only (post-fix) | 3/3 | 67515 | 74329 | 3/3 | 61 | 2 | 0 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | no-graph | 3/3 | 2818 | 12423 | 0/3 | 0 | 0 | 8 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | graph-only (pre-fix) | 3/3 | 15257 | 17551 | 3/3 | 20 | 1 | 0 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | graph-only (post-fix) | 3/3 | 8429 | 15958 | 3/3 | 22 | 0 | 0 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | hybrid | 3/3 | 2873 | 4957 | 0/3 | 0 | 0 | 6 |
deps-downstream-orbit-knowledge | graph-strength | production | no-graph | 3/3 | 16379 | 25955 | 0/3 | 0 | 0 | 30 |
deps-downstream-orbit-knowledge | graph-strength | production | graph-only (pre-fix) | 3/3 | 2250 | 11335 | 3/3 | 6 | 0 | 0 |
deps-downstream-orbit-knowledge | graph-strength | production | graph-only (post-fix) | 3/3 | 2833 | 11605 | 3/3 | 6 | 0 | 0 |
deps-downstream-orbit-knowledge | graph-strength | production | hybrid | 3/3 | 1383 | 1449 | 3/3 | 3 | 0 | 0 |
function-as-value-vs-direct-call | precision-gap | production | no-graph | 3/3 | 14162 | 16744 | 0/3 | 0 | 0 | 23 |
function-as-value-vs-direct-call | precision-gap | production | graph-only (pre-fix) | 3/3 | 16761 | 23051 | 3/3 | 27 | 3 | 0 |
function-as-value-vs-direct-call | precision-gap | production | graph-only (post-fix) | 3/3 | 13210 | 17944 | 3/3 | 26 | 0 | 0 |
function-as-value-vs-direct-call | precision-gap | production | hybrid | 3/3 | 4874 | 8743 | 0/3 | 0 | 0 | 21 |
generic-dispatch-concrete-impl | precision-gap | synthetic | no-graph | 3/3 | 11132 | 11207 | 0/3 | 0 | 0 | 4 |
generic-dispatch-concrete-impl | precision-gap | synthetic | graph-only (pre-fix) | 3/3 | 15741 | 24846 | 3/3 | 22 | 1 | 0 |
generic-dispatch-concrete-impl | precision-gap | synthetic | graph-only (post-fix) | 3/3 | 20293 | 37761 | 3/3 | 39 | 0 | 0 |
generic-dispatch-concrete-impl | precision-gap | synthetic | hybrid | 3/3 | 3819 | 4741 | 2/3 | 14 | 1 | 4 |
impl-divergence-trait-method | payload-volume | production | no-graph | 3/3 | 16776 | 24527 | 0/3 | 0 | 0 | 15 |
impl-divergence-trait-method | payload-volume | production | graph-only (pre-fix) | 3/3 | 8081 | 17847 | 3/3 | 24 | 1 | 0 |
impl-divergence-trait-method | payload-volume | production | graph-only (post-fix) | 3/3 | 12646 | 22257 | 3/3 | 45 | 0 | 0 |
implementors-benchsink-with-blanket | graph-strength | synthetic | no-graph | 3/3 | 11469 | 22553 | 0/3 | 0 | 0 | 7 |
implementors-benchsink-with-blanket | graph-strength | synthetic | graph-only (pre-fix) | 3/3 | 7370 | 34916 | 3/3 | 32 | 2 | 0 |
implementors-benchsink-with-blanket | graph-strength | synthetic | graph-only (post-fix) | 3/3 | 8705 | 10149 | 3/3 | 34 | 0 | 0 |
implementors-benchsink-with-blanket | graph-strength | synthetic | hybrid | 3/3 | 3981 | 14873 | 3/3 | 9 | 2 | 7 |
macro-expanded-callers | precision-gap | synthetic | no-graph | 3/3 | 11278 | 11512 | 0/3 | 0 | 0 | 7 |
macro-expanded-callers | precision-gap | synthetic | graph-only (pre-fix) | 3/3 | 6605 | 14098 | 3/3 | 22 | 2 | 0 |
macro-expanded-callers | precision-gap | synthetic | graph-only (post-fix) | 3/3 | 4368 | 4606 | 3/3 | 20 | 0 | 0 |
macro-expanded-callers | precision-gap | synthetic | hybrid | 3/3 | 2288 | 2485 | 0/3 | 0 | 0 | 7 |
module-surface-orbit-mcp | selector-ambiguity | production | no-graph | 3/3 | 5886 | 7437 | 0/3 | 0 | 0 | 16 |
module-surface-orbit-mcp | selector-ambiguity | production | graph-only (pre-fix) | 0/3 | 11134 | 19582 | 3/3 | 42 | 5 | 0 |
module-surface-orbit-mcp | selector-ambiguity | production | graph-only (post-fix) | 3/3 | 21885 | 26781 | 3/3 | 41 | 0 | 0 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | no-graph | 3/3 | 32141 | 51473 | 0/3 | 0 | 0 | 36 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | graph-only (pre-fix) | 3/3 | 32146 | 37822 | 3/3 | 29 | 4 | 0 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | graph-only (post-fix) | 3/3 | 27691 | 31259 | 3/3 | 31 | 1 | 0 |
reverse-export-orbit-error | graph-strength | production | no-graph | 3/3 | 13912 | 26349 | 0/3 | 0 | 0 | 20 |
reverse-export-orbit-error | graph-strength | production | graph-only (pre-fix) | 0/3 | 77792 | 78697 | 3/3 | 61 | 3 | 0 |
reverse-export-orbit-error | graph-strength | production | graph-only (post-fix) | 3/3 | 122948 | 141342 | 3/3 | 85 | 1 | 0 |
reverse-export-orbit-error | graph-strength | production | hybrid | 3/3 | 5830 | 13354 | 0/3 | 0 | 0 | 15 |
Claude Hybrid Utilization
Section titled “Claude Hybrid Utilization”| fixture | pass | median_tokens | graph_call_rate | graph_calls | shell/fs_calls | interpretation |
|---|---|---|---|---|---|---|
callers-2hop-graphbenchpolicy | 3/3 | 750 | 0/3 | 0 | 3 | Passed by shell/source fallback; graph avoided organically. |
construct-vs-match-benchevent-distinct | 3/3 | 719 | 0/3 | 0 | 3 | Passed by shell/source fallback; graph avoided organically. |
deps-downstream-orbit-knowledge | 3/3 | 1735 | 3/3 | 3 | 0 | Only Claude hybrid fixture that used graph; direct deps solved it cleanly. |
function-as-value-vs-direct-call | 3/3 | 2823 | 0/3 | 0 | 24 | Passed by source fallback with relatively heavy shell/read use. |
generic-dispatch-concrete-impl | 3/3 | 219 | 0/3 | 0 | 3 | Passed by direct source inspection; graph avoided organically. |
implementors-benchsink-with-blanket | 3/3 | 318 | 0/3 | 0 | 3 | Passed by direct source inspection; graph avoided organically. |
macro-expanded-callers | 3/3 | 203 | 0/3 | 0 | 3 | Passed by direct source inspection; graph avoided organically. |
reverse-export-orbit-error | 3/3 | 588 | 0/3 | 0 | 6 | Passed by shell/source fallback despite graph-only success post-fix. |
Claude hybrid shows that neutral hybrid prompting does not guarantee graph selection. It mostly measures whether Claude can route to the cheapest available source strategy, and for this fixture set that was usually shell/source reading rather than graph.
Codex Hybrid Utilization
Section titled “Codex Hybrid Utilization”| fixture | pass | median_tokens | graph_call_rate | graph_calls | shell/fs_calls | interpretation |
|---|---|---|---|---|---|---|
callers-2hop-graphbenchpolicy | 3/3 | 5495 | 3/3 | 14 | 1 | Used graph consistently; stayed well below no-graph. |
construct-vs-match-benchevent-distinct | 3/3 | 2873 | 0/3 | 0 | 6 | Passed by shell fallback; graph avoided organically. |
deps-downstream-orbit-knowledge | 3/3 | 1383 | 3/3 | 3 | 0 | Best graph-shaped win; deps solved directly. |
function-as-value-vs-direct-call | 3/3 | 4874 | 0/3 | 0 | 21 | Passed by shell fallback; graph avoided organically. |
generic-dispatch-concrete-impl | 3/3 | 3819 | 2/3 | 14 | 4 | Mixed graph use; source reading did the final disambiguation. |
implementors-benchsink-with-blanket | 3/3 | 3981 | 3/3 | 9 | 7 | Used graph, then shell/source checks; cheaper than both baselines. |
macro-expanded-callers | 3/3 | 2288 | 0/3 | 0 | 7 | Passed by shell fallback; graph avoided organically. |
reverse-export-orbit-error | 3/3 | 5830 | 0/3 | 0 | 15 | Passed by shell fallback; graph-only failed all seeds. |
Codex hybrid’s 24/24 pass rate is not proof that every hybrid-eligible fixture is graph-shaped. It is proof that Codex can route around graph gaps when shell/source tools are available.
Claude Graph-Only Cost Ratios
Section titled “Claude Graph-Only Cost Ratios”| fixture | class | mode | no-graph median | graph-only median | go/ng | graph-only pass |
|---|---|---|---|---|---|---|
deps-downstream-orbit-knowledge | graph-strength | production | 1610 | 1338 | 0.83x | 3/3 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | 4698 | 4201 | 0.89x | 3/3 |
impl-divergence-trait-method | payload-volume | production | 2004 | 3332 | 1.66x | 3/3 |
function-as-value-vs-direct-call | precision-gap | production | 2308 | 4273 | 1.85x | 3/3 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | 535 | 1016 | 1.90x | 3/3 |
module-surface-orbit-mcp | selector-ambiguity | production | 1916 | 3720 | 1.94x | 3/3 |
generic-dispatch-concrete-impl | precision-gap | synthetic | 197 | 864 | 4.39x | 3/3 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | 481 | 2290 | 4.76x | 3/3 |
implementors-benchsink-with-blanket | graph-strength | synthetic | 184 | 998 | 5.42x | 3/3 |
reverse-export-orbit-error | graph-strength | production | 537 | 3613 | 6.73x | 3/3 |
macro-expanded-callers | precision-gap | synthetic | 203 | 1545 | 7.61x | 3/3 |
const-value-extraction | payload-volume | production | 719 | 6979 | 9.71x | 3/3 |
Claude graph-only was excellent for accuracy after the graph fixes, but it was rarely the cheapest route. The const-value-extraction cell is the sharpest tradeoff: graph-only found the full set in every seed, while no-graph missed one constant twice, but graph-only used 9.71x the no-graph median tokens.
Codex Graph-Only Cost Ratios (pre-fix)
Section titled “Codex Graph-Only Cost Ratios (pre-fix)”| fixture | class | mode | no-graph median | graph-only median | go/ng | graph-only pass |
|---|---|---|---|---|---|---|
deps-downstream-orbit-knowledge | graph-strength | production | 16379 | 2250 | 0.14x | 3/3 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | 12397 | 5444 | 0.44x | 3/3 |
impl-divergence-trait-method | payload-volume | production | 16776 | 8081 | 0.48x | 3/3 |
macro-expanded-callers | precision-gap | synthetic | 11278 | 6605 | 0.59x | 3/3 |
implementors-benchsink-with-blanket | graph-strength | synthetic | 11469 | 7370 | 0.64x | 3/3 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | 32141 | 32146 | 1.00x | 3/3 |
function-as-value-vs-direct-call | precision-gap | production | 14162 | 16761 | 1.18x | 3/3 |
generic-dispatch-concrete-impl | precision-gap | synthetic | 11132 | 15741 | 1.41x | 3/3 |
module-surface-orbit-mcp | selector-ambiguity | production | 5886 | 11134 | 1.89x | 0/3 |
const-value-extraction | payload-volume | production | 8680 | 28222 | 3.25x | 3/3 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | 2818 | 15257 | 5.41x | 3/3 |
reverse-export-orbit-error | graph-strength | production | 13912 | 77792 | 5.59x | 0/3 |
Codex Graph-Only Cost Ratios (post-fix)
Section titled “Codex Graph-Only Cost Ratios (post-fix)”| fixture | class | mode | no-graph median | graph-only median | go/ng | graph-only pass |
|---|---|---|---|---|---|---|
deps-downstream-orbit-knowledge | graph-strength | production | 16379 | 2833 | 0.17x | 3/3 |
macro-expanded-callers | precision-gap | synthetic | 11278 | 4368 | 0.39x | 3/3 |
construct-vs-match-benchevent-distinct | precision-gap | synthetic | 2818 | 8429 | 2.99x | 3/3 |
callers-2hop-graphbenchpolicy | graph-strength | synthetic | 12397 | 8532 | 0.69x | 3/3 |
impl-divergence-trait-method | payload-volume | production | 16776 | 12646 | 0.75x | 3/3 |
implementors-benchsink-with-blanket | graph-strength | synthetic | 11469 | 8705 | 0.76x | 3/3 |
references-vs-callers-tool-registry-register | selector-ambiguity | production | 32141 | 27691 | 0.86x | 3/3 |
function-as-value-vs-direct-call | precision-gap | production | 14162 | 13210 | 0.93x | 3/3 |
generic-dispatch-concrete-impl | precision-gap | synthetic | 11132 | 20293 | 1.82x | 3/3 |
module-surface-orbit-mcp | selector-ambiguity | production | 5886 | 21885 | 3.72x | 3/3 |
const-value-extraction | payload-volume | production | 8680 | 67515 | 7.78x | 3/3 |
reverse-export-orbit-error | graph-strength | production | 13912 | 122948 | 8.84x | 3/3 |
Post-fix Codex graph-only beats no-graph on 7/12 fixtures (vs 5/12 pre-fix). The losses are concentrated in two fixtures where pre-fix Codex was bailing early into wrong answers (reverse-export, module-surface); post-fix it does the real work and the cost surfaces. The third large outlier is const-value-extraction (7.78x) — a payload-volume fixture where graph enumeration is structurally complete but expensive on a value-extraction task.
Claude Tool Diagnostics
Section titled “Claude Tool Diagnostics”Per-tool response size is measured in response characters, not model tokens. The transcripts do not expose per-tool token attribution.
| tool | invocations | succeeded | failed | success_rate | median_response_chars | p90_response_chars |
|---|---|---|---|---|---|---|
| callers | 7 | 4 | 3 | 57% | 1838 | 29660 |
| refs | 14 | 10 | 4 | 71% | 3945 | 18990 |
| implementors | 6 | 6 | 0 | 100% | 1122 | 1249 |
| deps | 6 | 6 | 0 | 100% | 606 | 606 |
| pack | 2 | 1 | 1 | 50% | 979 | 979 |
| search | 80 | 80 | 0 | 100% | 212 | 1427 |
| show | 324 | 323 | 1 | 100% | 762 | 2931 |
| overview | 0 | 0 | 0 | n/a | - | - |
Failed graph calls by message:
| message | count | affected tools |
|---|---|---|
invalid input: include entries must be code, doc, config, or all, got ["code"] | 3 | refs |
execution failed: selector BenchDerivedStruct::default:fn does not resolve to a node | 2 | callers |
execution failed: selector BenchDerivedStruct::default does not resolve to a node | 1 | callers |
invalid input: invalid selector ["file:crates/orbit-mcp/src/lib.rs"] | 1 | pack |
invalid input: selector file:crates/orbit-common/src/types.rs does not resolve to a node | 1 | show |
invalid input: include entries must be code, doc, config, or all, got ["code", "config"] | 1 | refs |
These failures are not the same shape as the pre-fix Codex scalar-list failures. The remaining Claude failures are nested-list/invalid-selector mistakes and a derive/default selector expectation that graph does not support.
Codex Tool Diagnostics (pre-fix)
Section titled “Codex Tool Diagnostics (pre-fix)”Per-tool response size is measured in response characters, not model tokens. The transcripts do not expose per-tool token attribution.
| tool | invocations | succeeded | failed | success_rate | median_response_chars | p90_response_chars |
|---|---|---|---|---|---|---|
| callers | 16 | 16 | 0 | 100% | 1838 | 382947 |
| refs | 61 | 39 | 22 | 64% | 494 | 90616 |
| implementors | 13 | 12 | 1 | 92% | 1122 | 1249 |
| deps | 6 | 6 | 0 | 100% | 606 | 606 |
| pack | 69 | 65 | 4 | 94% | 1724 | 19595 |
| search | 103 | 102 | 1 | 99% | 260 | 2653 |
| show | 84 | 84 | 0 | 100% | 777 | 1587 |
| overview | 22 | 22 | 0 | 100% | 2436 | 59717 |
Failed graph calls by message:
| message | count | affected tools |
|---|---|---|
invalid input: include must be an array of strings | 22 | refs |
invalid input: selectors must be an array | 4 | pack |
invalid input: query must not be empty | 1 | search |
invalid input: invalid selector BenchAuditSink | 1 | implementors |
The refs.include and pack.selectors scalar-list failures were addressed by T20260425-0729 before the post-fix rerun.
Codex Tool Diagnostics (post-fix)
Section titled “Codex Tool Diagnostics (post-fix)”| tool | invocations | succeeded | failed | success_rate | median_response_chars | p90_response_chars |
|---|---|---|---|---|---|---|
| show | 131 | 131 | 0 | 100% | 2181 | 4018 |
| pack | 102 | 102 | 0 | 100% | 6261 | 53902 |
| search | 101 | 99 | 2 | 98% | 617 | 14112 |
| refs | 45 | 43 | 2 | 95% | 1941 | 116359 |
| overview | 24 | 24 | 0 | 100% | 3626 | 128204 |
| callers | 23 | 23 | 0 | 100% | 3967 | 319274 |
| implementors | 9 | 9 | 0 | 100% | 2181 | 2689 |
| deps | 3 | 3 | 0 | 100% | 1396 | 1396 |
Failed graph calls by message:
| message | count | affected tools |
|---|---|---|
invalid input: query must not be empty | 2 | search |
invalid input: invalid selector ToolRegistry::register: selectors must start with dir:, file:, or symbol: | 1 | refs |
invalid input: invalid selector OrbitError: selectors must start with dir:, file:, or symbol: | 1 | refs |
The schema-coercion class (scalar-as-array) is fully resolved — 26 of the pre-fix 28 failures are gone. Remaining failures are query/selector ergonomics: empty query strings, and missing symbol:/file:/dir: prefixes on selectors. Both classes are recoverable by the agent on retry, but the latter is the same shape Claude hits post-fix and is a candidate for the next ergonomics task.
Tool-mix shift is also notable: post-fix Codex now leans on show (131 invocations vs 84 pre-fix) and pack (102 vs 69) — direct file/symbol inspection — instead of refs (45 vs 61) where the schema-coercion friction lived.
Failure Taxonomy
Section titled “Failure Taxonomy”Non-passing runs:
| provider | arm | fixture | seeds | classification | observed answer |
|---|---|---|---|---|---|
| claude | no-graph | const-value-extraction | 1, 2 | source-search miss | omitted V2_TOOL_WILDCARD_ROOTS; seed 3 found the full set |
| codex | graph-only (pre-fix) | module-surface-orbit-mcp | 1, 2, 3 | known graph bug / root-surface gap (T20260425-0739) | returned McpHost, serve_stdio; excluded OrbitToolServer. Resolved post-fix: 3/3 pass. |
| codex | graph-only (pre-fix) | reverse-export-orbit-error | 1, 2, 3 | known graph bug / re-export metadata gap (T20260425-0739) | returned []; excluded the original definition. Resolved post-fix: 3/3 pass. |
Anomaly flags are not mutually exclusive. Primary means the row count emitted by aggregate.py’s precedence-ordered taxonomy; independent means the flag was true even if another flag won precedence.
| provider | flag | runs | notes |
|---|---|---|---|
| claude | schema-coercion | 8 primary | 9 failed graph calls total; all recovered. Remaining shapes are nested-list/invalid-selector errors, not the pre-fix scalar-list issue. |
| claude | payload-firehose | 7 primary / 13 independent | Concentrated in graph-only const-value-extraction, macro-expanded-callers, reverse-export-orbit-error, implementors-benchsink-with-blanket, and one generic-dispatch-concrete-impl seed. |
| claude | wrong-tool | 0 | No graph-only Claude run failed. |
| claude | design-defect | 21 | Hybrid passed with zero graph calls. Interpret as “organic selection avoided graph”, not as a correctness failure. |
| codex (pre-fix) | schema-coercion | 25 primary | 28 failed graph calls total; most were recovered by retrying with array-shaped args. |
| codex (pre-fix) | payload-firehose | 2 primary / 6 independent | Primary taxonomy hides several firehose runs behind schema-coercion. |
| codex (pre-fix) | wrong-tool | 6 | The six pre-fix graph-only non-passing runs above. |
| codex (pre-fix) | design-defect | 13 | Hybrid passed with zero graph calls. Interpret as “organic selection avoided graph”, not as a correctness failure. |
| codex (post-fix) | schema-coercion | 4 primary | All 4 are query/selector ergonomics (empty query, missing dir:/file:/symbol: prefix), not the pre-fix scalar-list class. |
| codex (post-fix) | payload-firehose | 4 primary | Concentrated in const-value-extraction and reverse-export-orbit-error — fixtures where graph enumerates many candidates and Codex pages through them. |
| codex (post-fix) | wrong-tool | 0 | All 36 graph-only cells passed. |
| codex (post-fix) | design-defect | 13 | Same hybrid runs as pre-fix; hybrid was not rerun. |
Standout Fixtures
Section titled “Standout Fixtures”Top graph-only wins by token reduction:
| provider | fixture | result |
|---|---|---|
| codex (pre-fix) | deps-downstream-orbit-knowledge | 3/3 pass, 0.14x no-graph tokens |
| codex (post-fix) | deps-downstream-orbit-knowledge | 3/3 pass, 0.17x no-graph tokens |
| codex (post-fix) | macro-expanded-callers | 3/3 pass, 0.39x no-graph tokens |
| codex (pre-fix) | callers-2hop-graphbenchpolicy | 3/3 pass, 0.44x no-graph tokens |
| codex (pre-fix) | impl-divergence-trait-method | 3/3 pass, 0.48x no-graph tokens |
| claude | deps-downstream-orbit-knowledge | 3/3 pass, 0.83x no-graph tokens |
| claude | references-vs-callers-tool-registry-register | 3/3 pass, 0.89x no-graph tokens |
Accuracy standouts:
| provider | fixture | result |
|---|---|---|
| claude + codex (post-fix) | reverse-export-orbit-error | both providers graph-only 3/3 post-fix; Codex pre-fix was 0/3 |
| claude + codex (post-fix) | module-surface-orbit-mcp | both providers graph-only 3/3 post-fix; Codex pre-fix was 0/3 |
| claude | const-value-extraction | graph-only 3/3 while no-graph was 1/3 |
Top graph-only losses:
| provider | fixture | result |
|---|---|---|
| codex (post-fix) | reverse-export-orbit-error | 3/3 pass, 8.84x no-graph tokens (pre-fix was 0/3 at 5.59x) |
| codex (post-fix) | const-value-extraction | 3/3 pass, 7.78x no-graph tokens |
| claude | const-value-extraction | 3/3 pass, 9.71x no-graph tokens |
| claude | macro-expanded-callers | 3/3 pass, 7.61x no-graph tokens |
| claude | reverse-export-orbit-error | 3/3 pass post-fix, 6.73x no-graph tokens |
| codex (post-fix) | module-surface-orbit-mcp | 3/3 pass, 3.72x no-graph tokens (pre-fix was 0/3 at 1.89x) |
Interpretation
Section titled “Interpretation”The full v4 result — now with both providers post-fix on graph-only — supports keeping graph as an optional navigation surface, not as a replacement for source reads. Graph is excellent when the question maps directly to a precise graph primitive, with deps-downstream-orbit-knowledge the cleanest repeated win across both providers (0.17x for Codex post-fix, 0.83x for Claude).
T20260425-0739 was the right diagnosis. The post-fix Codex rerun confirms it directly: both reverse-export-orbit-error and module-surface-orbit-mcp flipped from 0/3 to 3/3 with no other changes. With Codex now post-fix, the Codex-vs-Claude comparison is clean — and the residual cost difference (Codex graph-only median 12,928 vs Claude graph-only median 2,330) is the load-bearing provider-behavior signal, not a tool-bug artifact.
The post-fix data also exposes a subtler pattern: fixing the bug increased cost on the formerly-failing fixtures. Pre-fix Codex bailed early on reverse-export-orbit-error (~78k tokens to fail); post-fix it does the real work and pays ~123k tokens to succeed. The pre-fix cost ratios on those two fixtures were under-estimates of “what graph-only actually costs to answer this question.” Post-fix cost ratios are the honest reading.
Hybrid is still the practical success case, but its meaning differs by provider. Codex selectively used graph and got the strongest overall cost/correctness profile on the hybrid subset. Claude mostly avoided graph in hybrid, so its 24/24 result is better read as “source fallback remains essential” than “graph was selected well.” The post-fix rerun does not change this — Codex hybrid was not rerun, but its 24/24 + 11/24 graph-call rate is unchallenged.
The highest-leverage next steps are:
- Add payload shaping for high-cardinality responses (
refs,overview,callers, and repeatedshow) so graph-only cannot spend 6x-10x tokens on enumeration. The post-fixreverse-export(8.84x) andconst-value-extraction(7.78x) cells are the load-bearing examples. - Tighten selector ergonomics: post-fix Codex still hit 4 failed graph calls — 2 empty queries and 2 unprefixed selectors (
OrbitError,ToolRegistry::registerinstead ofsymbol:OrbitError). Claude hits the same shape post-fix. Both providers want a forgiving selector parser. - Add a small hybrid-selection round with explicit “prefer graph when it directly answers the relationship; fall back to source for bodies/values” guidance. Neutral hybrid prompts measure organic tool choice, and Claude’s organic choice was mostly “do not use graph.”
- Optionally rerun Codex hybrid post-fix to close out tool-fix confound entirely. Expected delta: 3 failed graph calls → 0; pass rate stays at 24/24; median tokens drop slightly. Skipped here on cost grounds.
- Keep fixture-level ratios as the main cost metric. Aggregate medians hide both the
depswin and the expensive-but-correct post-fixreverse-exportresult.
Reproduction
Section titled “Reproduction”Aggregate tables:
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/aggregate.py \ --runs benchmarks/graph/v4/runs \ --tasks benchmarks/graph/v4/tasksCompleted Codex sweeps:
The original no-graph and graph-only Codex artifacts are pre-fix for T20260425-0729 and T20260425-0739. Codex graph-only was rerun post-fix on 2026-04-25; the pre-fix graph-only artifacts now live at _archive/codex-graph-only-pre-fix-T20260425-0739/.
# Pre-fix Codex no-graph + graph-only (graph-only artifacts later archived)GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider codex --arms no-graph graph-only --n 3# Pre-fix Codex hybrid (not rerun post-fix; passed 24/24)GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider codex --arms hybrid --n 3 \ --tasks callers-2hop-graphbenchpolicy construct-vs-match-benchevent-distinct \ deps-downstream-orbit-knowledge function-as-value-vs-direct-call \ generic-dispatch-concrete-impl implementors-benchsink-with-blanket \ macro-expanded-callers reverse-export-orbit-error# Post-fix Codex graph-only rerun (after archiving pre-fix dir)mv benchmarks/graph/v4/runs/codex/graph-only \ benchmarks/graph/v4/_archive/codex-graph-only-pre-fix-T20260425-0739
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider codex --arms graph-only --n 3To re-aggregate the pre-fix data, temporarily symlink the archived dir back into runs/codex/graph-only (or pass a different --runs root pointing at _archive/).
Completed Claude sweeps:
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider claude --arms no-graph graph-only --n 3GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py \ --provider claude --arms hybrid --n 3 \ --tasks callers-2hop-graphbenchpolicy construct-vs-match-benchevent-distinct \ deps-downstream-orbit-knowledge function-as-value-vs-direct-call \ generic-dispatch-concrete-impl implementors-benchsink-with-blanket \ macro-expanded-callers reverse-export-orbit-error