Skip to content

Panic-Site Inventory — 2026-05

Mirrored from docs/design/auditability/panic_inventory_2026-05.md. Edit the source document in the repository, not this generated page.

Author: claude-opus-4-7

Baseline inventory for the panic-audit task ([T20260509-6]). Companion to scripts/audit_panics.py; regenerate the counts in this doc by running scripts/audit_panics.sh --json.


The task body cites ~1,864 panic-pattern call sites and frames them as a risk to durable execution. That headline number conflates four very different populations. This doc:

  1. Establishes a four-bucket count so the audit target has a meaningful denominator.
  2. Defines a five-label taxonomy for classifying each non-test panic site in execution-critical crates.
  3. Lists every blast-radius site (one row per site) with its classification, so subsequent phases of the task have a checklist.

This is intentionally a measurement doc, not a fix-list. Phase 2/3 of the plan in [T20260509-6] consume these rows.


Counts produced by scripts/audit_panics.sh against crates/*/src/**/*.rs on 2026-05-09. Definitions:

  • total — every match for \.unwrap\(\) | \.expect\( | panic!\( | unreachable!\( | unimplemented!\( | todo!\(.
  • in_cfg_test — same, but restricted to lines the heuristic identifies as living in a #[cfg(test)] scope, an inner #[test] / #[tokio::test] function body, or a path matching *_tests.rs, */tests/*, *test_support*, */fixtures/*.
  • non_testtotal − in_cfg_test.
  • blast_radius — non-test matches in the four execution-critical crates only: orbit-engine, orbit-agent, orbit-tools, orbit-exec.
Cratetotalin_cfg_testnon_testblast_radius
orbit-agent221399
orbit-cli48848620
orbit-common10590150
orbit-core66165920
orbit-engine2962732323
orbit-exec676700
orbit-knowledge6850180
orbit-mcp8800
orbit-policy131300
orbit-registry0000
orbit-store17717520
orbit-tools191544
TOTAL192418497536

The classifier is a heuristic — it uses path conventions plus a search for the last column-zero #[cfg(test)] mod <name> { … } and any #[test] / #[tokio::test] function body. It does not handle strings that span multiple lines, r#"…"# raw strings inside if-attributes, or an inner module declaration in a separate file referenced via mod foo;. The per-site rows in §4 were spot-checked against the source and remain authoritative if a number drifts.

The remaining non_test count outside the four execution-critical crates (39 sites across orbit-cli, orbit-common, orbit-core, orbit-knowledge, orbit-store) is intentionally left out of scope for this task; the task body explicitly deprioritizes them (“CLI entry points, test code, one-shot setup paths”).


Each non-test execution-critical site below is labelled with exactly one of the following:

  • lock-poison — the panic is Mutex::lock().expect(...) or Condvar::wait(...).expect(...). A poisoned lock means another thread already panicked while holding the guard, so the protected state is inconsistent. Returning Err(...) instead does not recover anything; it just changes the call shape. Phase 3 leaves these as-is and adds an invariant comment.

  • infallible-by-construction — the panic is justified because the immediately-preceding code constructs the value being unwrapped, so the failure mode is unreachable unless the macro/library contract changes. Example: serde_json::to_string(&value) on a Value::Object(...) we just built, or value.as_object_mut() immediately after json!({...}).

  • documented-invariant — the panic is sound but the soundness argument is not local; it depends on an upstream contract or an enforced state machine. These get a // Invariant: … line in Phase 3 rather than a typed-error rewrite.

  • accidental-should-return-result — the panic is an actual bug: the failure mode is genuinely reachable (parse failure, I/O failure, truly fallible serialization, …) and the enclosing function either already returns Result or could be changed to. Phase 2 converts these.

  • test-only — non-test classification flag was wrong; the site belongs in a test block. None of the rows in §4 land here, but the label is included for completeness because regenerating the inventory may surface false positives that need re-labelling rather than fixing.


One row per non-test site in orbit-engine, orbit-agent, orbit-tools, orbit-exec. The class column drives Phase 2 vs Phase 3 routing: accidental-should-return-result ⇒ Phase 2 fix; everything else ⇒ Phase 3 invariant comment.

File:lineSnippetClassNotes
loop_engine/audit/mod.rs:144self.events.lock().expect("audit mutex").clone()lock-poisonInMemorySink event vec; only test/observation surface, but lock semantics matter.
loop_engine/audit/mod.rs:154self.events.lock().expect("audit mutex").push(...)lock-poisonSame mutex; emit path.
loop_engine/audit/mod.rs:163.lock().expect("blob mutex")lock-poisonInMemorySink blob vec.
loop_engine/audit/mod.rs:218Ok(writer.as_mut().expect("writer initialized"))infallible-by-constructionensure_writer calls writer.replace(...) immediately above; .as_mut() is sound.
loop_engine/audit/mod.rs:231let mut writer = self.writer.lock().expect("audit writer");lock-poisonJsonlFileSink writer mutex.
loop_engine/replay_transport.rs:80let turns = self.turns.lock().expect("replay turns mutex");lock-poison
loop_engine/replay_transport.rs:81let mut cursor = self.cursor.lock().expect("replay cursor mutex");lock-poison
loop_engine/tool_dispatch.rs:33property.as_object_mut().expect("parameter schema")infallible-by-constructionschema_for_param_type returns json!({...}), always an Object.
loop_engine/tool_dispatch.rs:50input_schema.as_object_mut().expect("object")infallible-by-constructioninput_schema is built from json!({...}) two lines above.
File:lineSnippetClassNotes
activity_job/agent_loop_driver.rs:198cell.lock().expect("replay mutex poisoned")lock-poisonProcess-global REPLAY_TRANSPORT cache.
activity_job/agent_loop_driver.rs:211*cell.lock().expect("replay mutex poisoned") = None;lock-poisonSame cache; reset path.
activity_job/cli_runner/supervisor.rs:174buf.lock().expect("subprocess output buf poisoned")lock-poisonStdout/stderr accumulation buffer.
activity_job/groundhog.rs:845.lock().expect("groundhog attempt mutex poisoned")lock-poison
activity_job/groundhog.rs:853.lock().expect("groundhog attempt mutex poisoned")lock-poison
activity_job/groundhog.rs:859self.state.lock().expect("groundhog attempt mutex poisoned")lock-poison
activity_job/groundhog.rs:877self.state.lock().expect("groundhog attempt mutex poisoned")lock-poison
activity_job/job_executor/concurrency.rs:17self.state.lock().expect("sem poisoned")lock-poisonSemaphore mutex.
activity_job/job_executor/concurrency.rs:19self.cond.wait(guard).expect("sem poisoned")lock-poisonCondvar wait propagates poison.
activity_job/job_executor/concurrency.rs:28self.state.lock().expect("sem poisoned")lock-poison
activity_job/job_executor/exec_ctx.rs:27self.pipeline.lock().expect("pipeline poisoned").clone()lock-poison
activity_job/job_executor/exec_ctx.rs:79.lock().expect("pipeline poisoned")lock-poison
activity_job/job_executor/fan_out.rs:59ctx.pipeline.lock().expect("pipeline poisoned").clone()lock-poisonFan-out call site cited in task body.
activity_job/job_executor/fan_out.rs:69.lock().expect("results poisoned")lock-poison
activity_job/job_executor/fan_out.rs:117.lock().expect("results poisoned")lock-poison
activity_job/job_executor/fan_out.rs:130results.into_inner().expect("results poisoned")lock-poisonMutex::into_inner() poison.
activity_job/job_executor/mod.rs:150.lock().expect("pipeline poisoned")lock-poison
activity_job/job_executor/parallel.rs:46h.join().expect("branch thread panicked")documented-invariantPropagates branch-thread panic to parent on purpose; the alternative is silent loss.
activity_job/job_executor/target.rs:39ctx.sessions.lock().expect("sessions poisoned")lock-poison
activity_job/jsonl_sink.rs:56self.writer.lock().expect("v2 jsonl writer mutex")lock-poisonV2 audit writer; durability surface.
activity_job/tool_enforcement.rs:51self.tripped.lock().expect("tripped mutex").clone()lock-poison
activity_job/tool_enforcement.rs:77*self.tripped.lock().expect("tripped mutex") = ...lock-poison
activity_job/tool_enforcement.rs:101*self.tripped.lock().expect("tripped mutex") = ...lock-poison
File:lineSnippetClassNotes
builtin/orbit/duel/plan_winner.rs:41serde_json::to_string(&winner_payload).expect("winner payload serializes")accidental-should-return-resultwinner_payload contains user-supplied strings; serialization can theoretically fail (e.g. non-utf8 path strings). The enclosing function already returns Result<Value, OrbitError> — Phase 2 converts.
builtin/orbit/knowledge/write.rs:143Selector::Dir { .. } => unreachable!()documented-invariantA Selector::Dir is filtered out earlier in the same function before this match runs; needs a comment but remains an unreachable!.
graph.rs:240value.as_object_mut().expect("node context payload object")infallible-by-constructionvalue = json!({...}) two lines above.
graph.rs:245value.as_object_mut().expect("node context payload object")infallible-by-constructionSame value from line 231.

All non-test panic sites are confined to the test module (#[cfg(test)] mod tests { … }). No blast-radius rows.


ClassCountPhase
lock-poison28Phase 3 (invariant comment)
infallible-by-construction5Phase 3 (invariant comment, optional)
documented-invariant2Phase 3 (invariant comment)
accidental-should-return-result1Phase 2 (convert to ?)
test-only0
TOTAL36

lock-poison dominates by an order of magnitude. Per the plan’s risk section, converting these does not actually buy durability — a poisoned mutex means another thread already panicked while holding the guard, so the protected state is inconsistent regardless of whether we panic or return Err. Phase 3 leaves them as .expect(...) with an explicit invariant comment instead of pretending typed errors give us recovery we cannot actually provide.


This doc satisfies AC#1 of [T20260509-6] (“Inventory of all .unwrap/.expect/panic! sites, classified into categories”). Subsequent phases consume §4:

  • AC#2 (“execution-critical paths converted”) → Phase 2 changes the one accidental-should-return-result row.
  • AC#3 (“intentional panics carry rationale”) → Phase 3 adds invariant comments to the 35 non-accidental rows.
  • AC#5 (“net panic-site count materially reduced”) → measured against the 36-row blast-radius bucket, not the 1,924 headline. Target: blast_radius count after Phases 2-3 ≤ 10 uncommented sites. Every commented site carries an invariant rationale.

Terminal window
# Refreshed counts only:
scripts/audit_panics.sh
# Counts + JSON dump + blast-radius site list:
scripts/audit_panics.sh --json --sites > /tmp/audit_panics.json
# Single-line JSON for CI consumption:
scripts/audit_panics.sh --json | jq '.totals'

When the blast-radius count moves, refresh §2 and §4 in the same commit that introduced the change so the inventory does not drift.