Graph Benchmark Issues
Mirrored from
benchmarks/graph/v1/ISSUES.md. Edit the source document in the repository, not this generated page.
This note records concrete token-usage issues observed in the preserved Codex benchmark transcripts for locate-agentruntime.
Method
Section titled “Method”- These are rough token estimates, not provider-reported billing numbers.
- I estimated
rough_tokens ~= output_characters / 4. - The benchmark artifacts do not currently expose per-step Codex token accounting, so the best available proxy is the size of each command’s captured output in the transcript.
- Large outputs that appear early are especially expensive because they are likely replayed into later cached context.
Main Finding
Section titled “Main Finding”The graph runs are spending budget on broad orientation dumps and duplicate verification, not on the direct answer path.
- For this task, the cheapest useful graph path is:
orbit tool run orbit.graph.search --input '{"query":"AgentRuntime","type":"symbol","kind":"trait","limit":10}'orbit tool run orbit.graph.implementors --input '{"trait_selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait"}'orbit tool run orbit.graph.show --input '{"selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait","depth":1,"siblings":false,"children":true}'- The most expensive graph steps were broader than necessary for that workflow.
Concrete Issues
Section titled “Concrete Issues”| Issue | Example command | Transcript | Output chars | Rough tokens | Why it is expensive |
|---|---|---|---|---|---|
| Oversized graph overview | orbit tool run orbit.graph.overview --input '{"prefix":"crates/orbit-agent/src"}' | runs/codex/hybrid/locate-agentruntime/2.transcript.json:12 | 65,756 | ~16,439 | Dumps 47 files and 427 symbols. This is far larger than needed after the trait search already succeeded. |
| Full skill file loaded into run context | sed -n '1,220p' .orbit/resources/skills/orbit-graph/SKILL.md | runs/codex/graph-only/locate-agentruntime/1.transcript.json:5 | 5,563 | ~1,391 | Loads instructions into the conversation before any task-specific graph call. This is fixed overhead for graph-mode runs. |
| Noisy refs output around the trait | orbit tool run orbit.graph.refs --input '{"selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait","limit":50}' | runs/codex/graph-only/locate-agentruntime/1.transcript.json:21 | 6,183 | ~1,546 | Returns doc sections and README hits in addition to runtime implementors, so the model pays for irrelevant references. |
| Broad pack of all impl blocks | orbit tool run orbit.graph.pack --input '{"selectors":["symbol:crates/orbit-agent/src/providers/claude/claude_runtime.rs#ClaudeRuntime:impl","symbol:crates/orbit-agent/src/providers/codex/codex_runtime.rs#CodexRuntime:impl","symbol:crates/orbit-agent/src/providers/gemini/gemini_runtime.rs#GeminiRuntime:impl","symbol:crates/orbit-agent/src/providers/mock_agent/mock_agent_runtime.rs#MockAgentRuntime:impl","symbol:crates/orbit-agent/src/providers/ollama/ollama_runtime.rs#OllamaRuntime:impl"]}' | runs/codex/graph-only/locate-agentruntime/1.transcript.json:24 | 4,509 | ~1,127 | Helpful, but still a sizable blob of source that the model later restates almost directly. |
| Broad search that pulls in benchmark YAML noise | orbit tool run orbit.graph.search --input '{"query":"AgentRuntime","limit":10}' | runs/codex/graph-only/locate-agentruntime/1.transcript.json:9 | 1,975 | ~494 | Returns benchmarks/graph/tasks/locate-agentruntime.yaml config keys before code symbols. |
| Duplicate raw file verification after graph already answered the question | Multiple sed -n and `nl -ba … | sed -n` reads over runtime files | runs/codex/hybrid/locate-agentruntime/2.transcript.json:19-42 | 31,272 total | ~7,818 total |
| Broad no-graph baseline search is also noisy | rg -n "AgentRuntime" crates . | runs/codex/no-graph/locate-agentruntime/1.transcript.json:7 | 12,909 | ~3,227 | Includes AGENTS.md, CLAUDE.md, benchmark YAML, and design docs. This is wasteful too, but it is still smaller than the giant graph overview dump. |
Transcript Notes
Section titled “Transcript Notes”Hybrid rerun: the direct answer path was already available early
Section titled “Hybrid rerun: the direct answer path was already available early”These two commands were small and sufficient:
orbit tool run orbit.graph.search --input '{"query":"AgentRuntime","type":"symbol","kind":"trait","limit":10}'orbit tool run orbit.graph.implementors --input '{"trait_selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait"}'- The search result at
runs/codex/hybrid/locate-agentruntime/2.transcript.json:11already identifiesAgentRuntimeincrates/orbit-agent/src/runtime/runtime_trait.rs. - The implementor query at
runs/codex/hybrid/locate-agentruntime/2.transcript.json:15returns all five runtime implementors directly. - The very large overview at
runs/codex/hybrid/locate-agentruntime/2.transcript.json:12came between those steps and appears unnecessary for this task.
Graph-only: the expensive parts were mostly broad graph context
Section titled “Graph-only: the expensive parts were mostly broad graph context”runs/codex/graph-only/locate-agentruntime/1.transcript.json:9uses an unfocused search that surfaces benchmark task YAML.runs/codex/graph-only/locate-agentruntime/1.transcript.json:21usesorbit.graph.refs, which returns many non-implementor references.runs/codex/graph-only/locate-agentruntime/1.transcript.json:24usesorbit.graph.packto pull the full impl bodies for all five runtimes.
Together, those three graph-tool outputs account for about 12,667 characters, or roughly 3,167 tokens, before counting the skill read overhead.
Rough Per-Run Breakdown
Section titled “Rough Per-Run Breakdown”These totals sum captured command output size by category.
| Run | Dominant category | Captured chars | Rough tokens |
|---|---|---|---|
graph-only/1 | Graph tool output | 16,914 | ~4,229 |
graph-only/1 | Skill file read | 5,563 | ~1,391 |
no-graph/1 | Source reads | 20,362 | ~5,091 |
no-graph/1 | Ripgrep output | 13,590 | ~3,398 |
hybrid/2 | Graph tool output | 67,917 | ~16,979 |
hybrid/2 | Source reads | 31,272 | ~7,818 |
Recommendations
Section titled “Recommendations”- For narrow symbol-location tasks, do not call
orbit.graph.overviewafterorbit.graph.searchalready found the target symbol. - Prefer
search -> implementors -> showoveroverview -> refs -> pack. - Avoid broad
orbit.graph.searchcalls withouttype,kind, orprefixfilters. - Tighten
orbit.graph.refsusage or post-filter its results so docs and benchmark YAML do not dominate the output. - If graph tools already provide the trait and implementor list, do not reread every provider source file unless the benchmark explicitly requires code-level behavioral summaries.