Skip to content

Knowledge Graph Benchmarks

Mirrored from benchmarks/graph/README.md. Edit the source document in the repository, not this generated page.

Measures how much navigation budget an agent spends on the same task under three tool surfaces: no-graph (shell only), graph-only (Orbit graph MCP tools only), or hybrid (both).

See docs/design/knowledge-graph/ for the graph itself.

Series closed. Read RESULTS.md for the cross-round synthesis and findings.

VersionScopeReport
v1Initial baselineRESULTS.md
v2Extended fixturesRESULTS.md
v3Calibrated cost; published null resultRESULTS.md
v4Diagnostic round, 192 planned cells plus Codex post-fix graph-only rerunRESULTS.md
v5Feature validation (source_regex), 9 cellsRESULTS.md

All rounds frozen.

Terminal window
# Single cell
GRAPH_VERSION=v4 benchmarks/graph/scripts/run.sh graph-only reverse-export-orbit-error 1 --provider codex
# Sweep
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/sweep.py --provider codex --arms graph-only --n 3
# Aggregate
GRAPH_VERSION=v4 python3 benchmarks/graph/scripts/aggregate.py \
--runs benchmarks/graph/v4/runs --tasks benchmarks/graph/v4/tasks
benchmarks/graph/<version>/runs/<provider>/<arm>/<task_id>/<seed>.json
benchmarks/graph/<version>/runs/_sweeps/<provider>/<sweep_id>/order.json

Records (<seed>.json) include verdict, token counts, wall time, and tool-call histogram. v1-v4 retain full transcripts (<seed>.transcript.json); v5 is records-only.

Version freezing rules and round structure: ../CONVENTIONS.md.

benchmarks/graph/
├── README.md # this file (shared across versions)
├── v1/ # FROZEN round 1
│ ├── README.md # version-specific banner
│ ├── METHOD.md
│ ├── RESULTS.md
│ ├── mcp.json
│ ├── scripts/ # harness as it ran
│ ├── tasks/ # fixtures as they were graded
│ └── runs/ # per-cell records
├── v2/ # FROZEN round 2
│ └── ...
├── v3/ # FROZEN round 3
│ └── ...
├── v4/ # FROZEN round 4
│ └── ...
└── v5/ # FROZEN round 5
└── ...