Skip to content

v1 — Method & Caveats

Mirrored from benchmarks/graph/v1/METHOD.md. Edit the source document in the repository, not this generated page.

This file documents how the graph_v1 sweep was conducted. The report itself is RESULTS.md in this directory. Conventions governing this document’s shape are in benchmarks/CONVENTIONS.md.

9d42ccea3323b3c0208e5590125703ddc9631ae2

The harness code, fixtures, and run records under benchmarks/graph/v1/ reflect the state of the repository at this SHA.

v1 is the first frozen round; no prior version to diff against.

  • Providers: Claude (sonnet-4-6 + haiku-4-5, via Claude Code CLI), Codex (gpt-5.4, via codex exec).
  • Arms: no-graph (Read/Grep/Glob + shell), graph-only (MCP graph tools only), hybrid (both).
  • Fixtures: 6 (listed below).
  • Seeds: 5 per (provider × arm × fixture) cell.
  • Total runs: 2 providers × 3 arms × 6 fixtures × 5 seeds = 180 cells per provider, 360 total.
  • Sweep seed: 1609 (shared across providers for paired ordering).
  • Sweep date: 2026-04-22.
fixtureclassdifficultyone-line purpose
locate-agentruntimelocateeasyFind the AgentRuntime trait definition and list all 5 impls across orbit-agent’s CLI runtimes. Drift-trap on historical names (HttpAgentRuntime, AnthropicRuntime, OpenAIRuntime).
locate-v2-runtime-host-traitlocateeasyFind the V2RuntimeHost trait definition. Filename-collision trap: crates/orbit-core/src/runtime/v2_host.rs is the impl, not the definition.
trace-policy-denial-wiringtracemediumList every file that constructs (not destructures) LoopAuditEvent::PolicyDenial{...} plus every AuditSink impl. Deny-list excludes destructure-only example sites.
trace-v2runtime-production-implstracemediumList production V2RuntimeHost impls under crates/<name>/src/, excluding examples/tests/benches (5 example impls to filter out).
impact-scope-strategy-callsitesimpactmediumList files containing ScopeStrategy::<Variant> tokens (4 files). Deny-list covers doc-comment-only and bare-variant drift sites.
deps-orbit-knowledge-consumersdepseasyList crates declaring a direct dependency on orbit-knowledge (2 Cargo.toml files). Deny-list covers hallucinated consumer crates.

These are material to interpreting RESULTS.md and should be read before relying on any headline number.

  1. Hybrid tool-utilization was 1/60 runs (≈1.7 %). Claude made exactly one orbit_graph_implementors call across 30 hybrid runs; Codex made zero graph calls across 30 hybrid runs. Every other hybrid run solved the task with Grep / Read / shell rg. Consequence: the cost-parity between hybrid and no-graph in the primary table is a null result (schema-in-prompt is cheap because the tools are never invoked), not evidence that graph tools integrate cheaply. See RESULTS.md §Tool-utilization audit.
  2. Codex cost is reported as $0. The Codex CLI does not emit billing; the provider normalizer faithfully records 0. All USD figures in RESULTS.md are Claude only.
  3. Fixture set is grep-skewed. 5 of 6 fixtures can be solved via rg in 1–2 calls. This likely under-counts the graph tools’ value. v2 should add navigation-heavy fixtures (cross-crate trait walks with name collisions, reverse caller queries, refactor-impact across the type graph) where grep produces ambiguous or noisy results and the structural index is load-bearing.
  4. locate-agentruntime had pre-sweep drift (missing OllamaRuntime in the expected impl list; stale commit_sha: 87b709c2 predating Ollama). The drift was caught and corrected before the v1 run. The fixture shipped under graph_v1/tasks/locate-agentruntime.yaml is the corrected version anchored at commit SHA 9d42ccea.
  5. Pass-rate ceiling on Claude. 119/120 passes regardless of arm. This sweep cannot discriminate Claude arms on accuracy — only on cost. v2 should include harder fixtures if accuracy separation is the goal.
  6. Tool-utilization audit was ad-hoc. Counts were produced by a transcript-level scan at report time, not by aggregate.py. RESULTS.md §Recommendations #5 files a follow-up to fold utilization into the aggregator.

v1 uses the shared harness at benchmarks/graph/scripts/ — at freeze time v1 had its own scripts/ directory; that was removed post-freeze once the shared aggregator was confirmed backward-compatible with v1’s record schema. To regenerate the tables:

make -C benchmarks graph-aggregate GRAPH_VERSION=v1

Or directly:

GRAPH_VERSION=v1 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v1/runs \
--tasks benchmarks/graph/v1/tasks

This regenerates the primary + secondary tables. The tool-utilization audit in RESULTS.md is not regenerated by this command; it was produced by an ad-hoc transcript scan.