v1 — Method & Caveats
Mirrored from
benchmarks/graph/v1/METHOD.md. Edit the source document in the repository, not this generated page.
This file documents how the graph_v1 sweep was conducted. The report itself is RESULTS.md in this directory. Conventions governing this document’s shape are in benchmarks/CONVENTIONS.md.
Harness git SHA at freeze
Section titled “Harness git SHA at freeze”9d42ccea3323b3c0208e5590125703ddc9631ae2
The harness code, fixtures, and run records under benchmarks/graph/v1/ reflect the state of the repository at this SHA.
Delta vs v0
Section titled “Delta vs v0”v1 is the first frozen round; no prior version to diff against.
- Providers: Claude (sonnet-4-6 + haiku-4-5, via Claude Code CLI), Codex (gpt-5.4, via
codex exec). - Arms:
no-graph(Read/Grep/Glob + shell),graph-only(MCP graph tools only),hybrid(both). - Fixtures: 6 (listed below).
- Seeds: 5 per (provider × arm × fixture) cell.
- Total runs: 2 providers × 3 arms × 6 fixtures × 5 seeds = 180 cells per provider, 360 total.
- Sweep seed: 1609 (shared across providers for paired ordering).
- Sweep date: 2026-04-22.
Fixture list
Section titled “Fixture list”| fixture | class | difficulty | one-line purpose |
|---|---|---|---|
locate-agentruntime | locate | easy | Find the AgentRuntime trait definition and list all 5 impls across orbit-agent’s CLI runtimes. Drift-trap on historical names (HttpAgentRuntime, AnthropicRuntime, OpenAIRuntime). |
locate-v2-runtime-host-trait | locate | easy | Find the V2RuntimeHost trait definition. Filename-collision trap: crates/orbit-core/src/runtime/v2_host.rs is the impl, not the definition. |
trace-policy-denial-wiring | trace | medium | List every file that constructs (not destructures) LoopAuditEvent::PolicyDenial{...} plus every AuditSink impl. Deny-list excludes destructure-only example sites. |
trace-v2runtime-production-impls | trace | medium | List production V2RuntimeHost impls under crates/<name>/src/, excluding examples/tests/benches (5 example impls to filter out). |
impact-scope-strategy-callsites | impact | medium | List files containing ScopeStrategy::<Variant> tokens (4 files). Deny-list covers doc-comment-only and bare-variant drift sites. |
deps-orbit-knowledge-consumers | deps | easy | List crates declaring a direct dependency on orbit-knowledge (2 Cargo.toml files). Deny-list covers hallucinated consumer crates. |
Known caveats
Section titled “Known caveats”These are material to interpreting RESULTS.md and should be read before relying on any headline number.
- Hybrid tool-utilization was 1/60 runs (≈1.7 %). Claude made exactly one
orbit_graph_implementorscall across 30 hybrid runs; Codex made zero graph calls across 30 hybrid runs. Every other hybrid run solved the task withGrep/Read/ shellrg. Consequence: the cost-parity betweenhybridandno-graphin the primary table is a null result (schema-in-prompt is cheap because the tools are never invoked), not evidence that graph tools integrate cheaply. SeeRESULTS.md§Tool-utilization audit. - Codex cost is reported as $0. The Codex CLI does not emit billing; the provider normalizer faithfully records 0. All USD figures in
RESULTS.mdare Claude only. - Fixture set is grep-skewed. 5 of 6 fixtures can be solved via
rgin 1–2 calls. This likely under-counts the graph tools’ value. v2 should add navigation-heavy fixtures (cross-crate trait walks with name collisions, reverse caller queries, refactor-impact across the type graph) where grep produces ambiguous or noisy results and the structural index is load-bearing. locate-agentruntimehad pre-sweep drift (missingOllamaRuntimein the expected impl list; stalecommit_sha: 87b709c2predating Ollama). The drift was caught and corrected before the v1 run. The fixture shipped undergraph_v1/tasks/locate-agentruntime.yamlis the corrected version anchored at commit SHA9d42ccea.- Pass-rate ceiling on Claude. 119/120 passes regardless of arm. This sweep cannot discriminate Claude arms on accuracy — only on cost. v2 should include harder fixtures if accuracy separation is the goal.
- Tool-utilization audit was ad-hoc. Counts were produced by a transcript-level scan at report time, not by
aggregate.py.RESULTS.md§Recommendations #5 files a follow-up to fold utilization into the aggregator.
Reproduction
Section titled “Reproduction”v1 uses the shared harness at benchmarks/graph/scripts/ — at freeze time v1 had its own scripts/ directory; that was removed post-freeze once the shared aggregator was confirmed backward-compatible with v1’s record schema. To regenerate the tables:
make -C benchmarks graph-aggregate GRAPH_VERSION=v1Or directly:
GRAPH_VERSION=v1 python3 benchmarks/graph/scripts/aggregate.py \ --runs benchmarks/graph/v1/runs \ --tasks benchmarks/graph/v1/tasksThis regenerates the primary + secondary tables. The tool-utilization audit in RESULTS.md is not regenerated by this command; it was produced by an ad-hoc transcript scan.