Skip to content

Graph Benchmark Issues

Mirrored from benchmarks/graph/v1/ISSUES.md. Edit the source document in the repository, not this generated page.

This note records concrete token-usage issues observed in the preserved Codex benchmark transcripts for locate-agentruntime.

  • These are rough token estimates, not provider-reported billing numbers.
  • I estimated rough_tokens ~= output_characters / 4.
  • The benchmark artifacts do not currently expose per-step Codex token accounting, so the best available proxy is the size of each command’s captured output in the transcript.
  • Large outputs that appear early are especially expensive because they are likely replayed into later cached context.

The graph runs are spending budget on broad orientation dumps and duplicate verification, not on the direct answer path.

  • For this task, the cheapest useful graph path is:
Terminal window
orbit tool run orbit.graph.search --input '{"query":"AgentRuntime","type":"symbol","kind":"trait","limit":10}'
orbit tool run orbit.graph.implementors --input '{"trait_selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait"}'
orbit tool run orbit.graph.show --input '{"selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait","depth":1,"siblings":false,"children":true}'
  • The most expensive graph steps were broader than necessary for that workflow.
IssueExample commandTranscriptOutput charsRough tokensWhy it is expensive
Oversized graph overvieworbit tool run orbit.graph.overview --input '{"prefix":"crates/orbit-agent/src"}'runs/codex/hybrid/locate-agentruntime/2.transcript.json:1265,756~16,439Dumps 47 files and 427 symbols. This is far larger than needed after the trait search already succeeded.
Full skill file loaded into run contextsed -n '1,220p' .orbit/resources/skills/orbit-graph/SKILL.mdruns/codex/graph-only/locate-agentruntime/1.transcript.json:55,563~1,391Loads instructions into the conversation before any task-specific graph call. This is fixed overhead for graph-mode runs.
Noisy refs output around the traitorbit tool run orbit.graph.refs --input '{"selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait","limit":50}'runs/codex/graph-only/locate-agentruntime/1.transcript.json:216,183~1,546Returns doc sections and README hits in addition to runtime implementors, so the model pays for irrelevant references.
Broad pack of all impl blocksorbit tool run orbit.graph.pack --input '{"selectors":["symbol:crates/orbit-agent/src/providers/claude/claude_runtime.rs#ClaudeRuntime:impl","symbol:crates/orbit-agent/src/providers/codex/codex_runtime.rs#CodexRuntime:impl","symbol:crates/orbit-agent/src/providers/gemini/gemini_runtime.rs#GeminiRuntime:impl","symbol:crates/orbit-agent/src/providers/mock_agent/mock_agent_runtime.rs#MockAgentRuntime:impl","symbol:crates/orbit-agent/src/providers/ollama/ollama_runtime.rs#OllamaRuntime:impl"]}'runs/codex/graph-only/locate-agentruntime/1.transcript.json:244,509~1,127Helpful, but still a sizable blob of source that the model later restates almost directly.
Broad search that pulls in benchmark YAML noiseorbit tool run orbit.graph.search --input '{"query":"AgentRuntime","limit":10}'runs/codex/graph-only/locate-agentruntime/1.transcript.json:91,975~494Returns benchmarks/graph/tasks/locate-agentruntime.yaml config keys before code symbols.
Duplicate raw file verification after graph already answered the questionMultiple sed -n and `nl -ba …sed -n` reads over runtime filesruns/codex/hybrid/locate-agentruntime/2.transcript.json:19-4231,272 total~7,818 total
Broad no-graph baseline search is also noisyrg -n "AgentRuntime" crates .runs/codex/no-graph/locate-agentruntime/1.transcript.json:712,909~3,227Includes AGENTS.md, CLAUDE.md, benchmark YAML, and design docs. This is wasteful too, but it is still smaller than the giant graph overview dump.

Hybrid rerun: the direct answer path was already available early

Section titled “Hybrid rerun: the direct answer path was already available early”

These two commands were small and sufficient:

Terminal window
orbit tool run orbit.graph.search --input '{"query":"AgentRuntime","type":"symbol","kind":"trait","limit":10}'
orbit tool run orbit.graph.implementors --input '{"trait_selector":"symbol:crates/orbit-agent/src/runtime/runtime_trait.rs#AgentRuntime:trait"}'
  • The search result at runs/codex/hybrid/locate-agentruntime/2.transcript.json:11 already identifies AgentRuntime in crates/orbit-agent/src/runtime/runtime_trait.rs.
  • The implementor query at runs/codex/hybrid/locate-agentruntime/2.transcript.json:15 returns all five runtime implementors directly.
  • The very large overview at runs/codex/hybrid/locate-agentruntime/2.transcript.json:12 came between those steps and appears unnecessary for this task.

Graph-only: the expensive parts were mostly broad graph context

Section titled “Graph-only: the expensive parts were mostly broad graph context”
  • runs/codex/graph-only/locate-agentruntime/1.transcript.json:9 uses an unfocused search that surfaces benchmark task YAML.
  • runs/codex/graph-only/locate-agentruntime/1.transcript.json:21 uses orbit.graph.refs, which returns many non-implementor references.
  • runs/codex/graph-only/locate-agentruntime/1.transcript.json:24 uses orbit.graph.pack to pull the full impl bodies for all five runtimes.

Together, those three graph-tool outputs account for about 12,667 characters, or roughly 3,167 tokens, before counting the skill read overhead.

These totals sum captured command output size by category.

RunDominant categoryCaptured charsRough tokens
graph-only/1Graph tool output16,914~4,229
graph-only/1Skill file read5,563~1,391
no-graph/1Source reads20,362~5,091
no-graph/1Ripgrep output13,590~3,398
hybrid/2Graph tool output67,917~16,979
hybrid/2Source reads31,272~7,818
  • For narrow symbol-location tasks, do not call orbit.graph.overview after orbit.graph.search already found the target symbol.
  • Prefer search -> implementors -> show over overview -> refs -> pack.
  • Avoid broad orbit.graph.search calls without type, kind, or prefix filters.
  • Tighten orbit.graph.refs usage or post-filter its results so docs and benchmark YAML do not dominate the output.
  • If graph tools already provide the trait and implementor list, do not reread every provider source file unless the benchmark explicitly requires code-level behavioral summaries.