v1 — Method & Caveats

Mirrored from benchmarks/graph/v1/METHOD.md. Edit the source document in the repository, not this generated page.

This file documents how the graph_v1 sweep was conducted. The report itself is RESULTS.md in this directory. Conventions governing this document’s shape are in benchmarks/CONVENTIONS.md.

Harness git SHA at freeze

9d42ccea3323b3c0208e5590125703ddc9631ae2

The harness code, fixtures, and run records under benchmarks/graph/v1/ reflect the state of the repository at this SHA.

Delta vs v0

v1 is the first frozen round; no prior version to diff against.

Scope

Providers: Claude (sonnet-4-6 + haiku-4-5, via Claude Code CLI), Codex (gpt-5.4, via codex exec).
Arms: no-graph (Read/Grep/Glob + shell), graph-only (MCP graph tools only), hybrid (both).
Fixtures: 6 (listed below).
Seeds: 5 per (provider × arm × fixture) cell.
Total runs: 2 providers × 3 arms × 6 fixtures × 5 seeds = 180 cells per provider, 360 total.
Sweep seed: 1609 (shared across providers for paired ordering).
Sweep date: 2026-04-22.

Fixture list

fixture	class	difficulty	one-line purpose
`locate-agentruntime`	locate	easy	Find the `AgentRuntime` trait definition and list all 5 impls across `orbit-agent`’s CLI runtimes. Drift-trap on historical names (`HttpAgentRuntime`, `AnthropicRuntime`, `OpenAIRuntime`).
`locate-v2-runtime-host-trait`	locate	easy	Find the `V2RuntimeHost` trait definition. Filename-collision trap: `crates/orbit-core/src/runtime/v2_host.rs` is the impl, not the definition.
`trace-policy-denial-wiring`	trace	medium	List every file that constructs (not destructures) `LoopAuditEvent::PolicyDenial{...}` plus every `AuditSink` impl. Deny-list excludes destructure-only example sites.
`trace-v2runtime-production-impls`	trace	medium	List production `V2RuntimeHost` impls under `crates/<name>/src/`, excluding examples/tests/benches (5 example impls to filter out).
`impact-scope-strategy-callsites`	impact	medium	List files containing `ScopeStrategy::<Variant>` tokens (4 files). Deny-list covers doc-comment-only and bare-variant drift sites.
`deps-orbit-knowledge-consumers`	deps	easy	List crates declaring a direct dependency on `orbit-knowledge` (2 Cargo.toml files). Deny-list covers hallucinated consumer crates.

Known caveats

These are material to interpreting RESULTS.md and should be read before relying on any headline number.

Hybrid tool-utilization was 1/60 runs (≈1.7 %). Claude made exactly one orbit_graph_implementors call across 30 hybrid runs; Codex made zero graph calls across 30 hybrid runs. Every other hybrid run solved the task with Grep / Read / shell rg. Consequence: the cost-parity between hybrid and no-graph in the primary table is a null result (schema-in-prompt is cheap because the tools are never invoked), not evidence that graph tools integrate cheaply. See RESULTS.md §Tool-utilization audit.
Codex cost is reported as $0. The Codex CLI does not emit billing; the provider normalizer faithfully records 0. All USD figures in RESULTS.md are Claude only.
Fixture set is grep-skewed. 5 of 6 fixtures can be solved via rg in 1–2 calls. This likely under-counts the graph tools’ value. v2 should add navigation-heavy fixtures (cross-crate trait walks with name collisions, reverse caller queries, refactor-impact across the type graph) where grep produces ambiguous or noisy results and the structural index is load-bearing.
locate-agentruntime had pre-sweep drift (missing OllamaRuntime in the expected impl list; stale commit_sha: 87b709c2 predating Ollama). The drift was caught and corrected before the v1 run. The fixture shipped under graph_v1/tasks/locate-agentruntime.yaml is the corrected version anchored at commit SHA 9d42ccea.
Pass-rate ceiling on Claude. 119/120 passes regardless of arm. This sweep cannot discriminate Claude arms on accuracy — only on cost. v2 should include harder fixtures if accuracy separation is the goal.
Tool-utilization audit was ad-hoc. Counts were produced by a transcript-level scan at report time, not by aggregate.py. RESULTS.md §Recommendations #5 files a follow-up to fold utilization into the aggregator.

Reproduction

v1 uses the shared harness at benchmarks/graph/scripts/ — at freeze time v1 had its own scripts/ directory; that was removed post-freeze once the shared aggregator was confirmed backward-compatible with v1’s record schema. To regenerate the tables:

make -C benchmarks graph-aggregate GRAPH_VERSION=v1

Or directly:

GRAPH_VERSION=v1 python3 benchmarks/graph/scripts/aggregate.py \
  --runs benchmarks/graph/v1/runs \
  --tasks benchmarks/graph/v1/tasks

This regenerates the primary + secondary tables. The tool-utilization audit in RESULTS.md is not regenerated by this command; it was produced by an ad-hoc transcript scan.