Recovery
Every governed run can encounter blocked or failed states. AgentXchain ensures that every such state has an explicit recovery path — no state is terminal without an operator escape route.
How Recovery Works
When a governed run enters a blocked state, the orchestrator persists a recovery descriptor containing:
- typed_reason — what category of block occurred
- owner — who is responsible for resolution (always
humanfor operator-facing blocks) - recovery_action — the exact command or action to take
- turn_retained — whether the blocked turn is still assigned and can be resumed
- runtime_guidance — optional runtime-aware blocker explanations for gate failures caused by config/runtime ownership limits
The agentxchain status command always displays the current recovery action when a run is blocked.
When the blocker truly requires human action, AgentXchain also writes a structured escalation record to .agentxchain/human-escalations.jsonl and mirrors the current open item into HUMAN_TASKS.md. status surfaces the linked escalation id together with agentxchain unblock <id>.
When the latest gate failure exposes a deeper runtime/config blocker, status also renders runtime guidance. These codes are:
| Code | Meaning | Operator action |
|---|---|---|
invalid_binding | The owning role/runtime binding is structurally invalid for file ownership | Fix agentxchain.json, then agentxchain validate |
review_only_remote_dead_end | A remote review-only role owns a required artifact but can only return review artifacts | Fix agentxchain.json, then agentxchain validate |
proposal_apply_required | Required files exist only in the staged proposal, not the workspace yet | agentxchain proposal apply <turn_id> |
tool_defined_proof_not_strong_enough | MCP/tool-defined ownership cannot be proven statically | agentxchain role show <role_id> and inspect the tool contract |
Recovery Map
Approval Gates
| State | Cause | Recovery |
|---|---|---|
pending_phase_transition | Phase gate requires human approval | agentxchain approve-transition |
pending_run_completion | Run completion gate requires human approval | agentxchain approve-completion |
These are governance-mandated pauses, not failures. The run resumes automatically after approval unless a configured gate action fails. When that happens, the run blocks with typed_reason: gate_action_failed and the same approval command remains the recovery action. See Gate Actions.
Dispatch Failures
| State | Cause | Recovery |
|---|---|---|
dispatch_error | Adapter dispatch failed (API error, MCP transport failure, local CLI crash) | Fix the issue, then agentxchain step --resume |
The failed turn is retained. step --resume re-dispatches the same turn to the same role without re-assignment.
Escalations
| State | Cause | Recovery |
|---|---|---|
operator_escalation | Operator raised via agentxchain escalate | Resolve the issue, then agentxchain resume for run-level or retained manual turns; use agentxchain step --resume for retained non-manual turns |
retries_exhausted | Max retries hit on a role turn | agentxchain resume for retained manual turns, otherwise agentxchain step --resume |
Operator escalations record decision = "operator_escalated" in the decision ledger. Resolution records decision = "escalation_resolved". If the escalation targeted one retained turn out of several, the surfaced recovery command includes --turn <id> so the operator can run it directly.
Agent Requests
| State | Cause | Recovery |
|---|---|---|
needs_human | Agent returned blocked_on: "human:..." requesting human intervention | Resolve the stated issue, then agentxchain unblock <id> |
The normal needs_human path accepts the turn that raised the issue and clears it, so recovery assigns the next turn after unblock instead of trying to wait on a non-existent retained turn.
If the escalation is tied to a phase-exit gate, unblock <id> also reconciles the gate before dispatch. A satisfied standing gate can advance the phase even when there is no pending_phase_transition object; AgentXchain marks the gate passed, clears stale active turns and reservations from the exited phase, emits phase_cleanup, then dispatches the next phase's entry role.
When a human escalation is raised, three notification surfaces fire automatically:
- events.jsonl — a
human_escalation_raisedevent is appended with full escalation metadata (type, service, action, resolution command). - stderr — a structured local notice prints the escalation ID, type, action, and unblock command. This fires regardless of webhook configuration.
- webhooks — if configured, a
human_escalation_raisednotification is delivered to all subscribed webhooks.
When the escalation is resolved via agentxchain unblock <id>, the corresponding human_escalation_resolved event is emitted to all three surfaces.
Set AGENTXCHAIN_LOCAL_NOTIFY=1 on macOS to also receive native desktop notifications.
Hook Failures
| State | Cause | Recovery |
|---|---|---|
hook_block | Lifecycle hook failed without tampering | Follow the surfaced command: approve-*, accept-turn, runtime-aware retained-turn recovery, or resume --role depending on where the hook fired |
hook_tamper | Hook detected unauthorized file changes | Fix tampering, then follow the surfaced command (agentxchain resume for cleared or retained manual turns, agentxchain step --resume for retained non-manual turns, approval commands for gate pauses) |
Hook blocks during assignment require resume --role to re-assign. Hook tamper no longer assumes every blocked turn is resumable with step --resume; the persisted recovery action is derived from the actual retained-turn state.
Turn Conflicts
When a turn's files_changed overlap with files accepted by another turn since the conflicting turn was assigned, the governed-state layer detects a conflict. The first two detections mark the turn as conflicted but the run stays active. On the third consecutive detection (detection count ≥ 3), the run transitions to blocked with typed_reason: 'conflict_loop'.
agentxchain status shows both resolution options, the number of conflicting files, the detection count, and the overlap percentage. The suggested resolution is based on file overlap:
- Overlap < 50% →
reject_and_reassign(re-dispatch with conflict context; faster automation recovery) - Overlap ≥ 50% →
human_merge(too many overlapping files for clean re-dispatch; operator explicitly accepts the current staged result as the authoritative merge in one command)
The suggestion is guidance, not enforcement — operators may choose either path regardless of overlap.
| State | Cause | Recovery |
|---|---|---|
conflict_loop | 3+ consecutive conflict detections on the same turn | Default surfaced action: agentxchain reject-turn --turn <id> --reassign; alternate single-step merge path: agentxchain accept-turn --turn <id> --resolution human_merge |
| Condition | Recovery |
|---|---|
| Conflicted turn (overlapping agent changes) | agentxchain reject-turn --reassign or agentxchain accept-turn --resolution human_merge (one-step acceptance of the staged merge result) |
| Validation failure (retryable) | agentxchain reject-turn then agentxchain step |
| Validation failure (non-retryable) | Manual fix, then agentxchain accept-turn |
Policy Escalations
| State | Cause | Recovery |
|---|---|---|
policy_escalation | A declarative policy with action: "escalate" fired (e.g., turn cap reached, role monopoly detected) | Resolve the policy condition (e.g., increase the limit in agentxchain.json, change routing), then use agentxchain resume for retained manual turns or cleared runs; use agentxchain step --resume for retained non-manual turns |
Policy escalations are distinct from operator escalations. They are triggered automatically by the declarative policy engine based on config rules, not by an operator command. The agentxchain status output shows which policy fired and what condition was violated. The decision-ledger.jsonl records decision = "policy_escalation" with the violating policy details.
Timeout
| State | Cause | Recovery |
|---|---|---|
timeout | A turn, phase, or run exceeded its configured time limit (timeouts in agentxchain.json) | Check whether the timeout is caused by a stuck agent or legitimately long work. If the limit was too tight, raise it with agentxchain config --set timeouts.per_turn_minutes <min> (or per_phase_minutes / per_run_minutes), then agentxchain resume |
Timeouts are enforced when accepted work crosses the turn-acceptance boundary, and agentxchain status shows read-only timeout pressure or persisted timeout blocks. They do not currently re-fire on approve-transition or approve-completion. Per-phase timeouts can be overridden in routing with timeout_minutes and timeout_action per phase.
Approval waits are exempt from timeout mutation, but they do not stop the phase/run clock. If a run sits in pending_phase_transition or pending_run_completion for days, the next accepted turn can immediately block on timeout:phase or timeout:run. That is why status should warn during the approval pause instead of waiting for the next acceptance boundary.
For the full config and report/status surface, see Timeouts.
Budget Exhaustion
| State | Cause | Recovery |
|---|---|---|
budget_exhausted | Cumulative turn costs exceeded per_run_max_usd | Increase budget with agentxchain config --set budget.per_run_max_usd <usd>, then agentxchain resume |
The turn that exhausts the budget IS accepted — its work is preserved. Only subsequent turns are blocked until the operator raises the budget limit. Use agentxchain config --set budget.per_run_max_usd <usd> for that simple recovery instead of hand-editing JSON. Because there is no retained turn, budget recovery re-assigns the next turn with agentxchain resume instead of replaying work with step --resume.
To prevent hard stops on budget exhaustion, switch to warn mode with agentxchain config --set budget.on_exceed warn. In warn mode the run continues past budget with observable warnings in status, events, and notifications instead of blocking.
Session-level budget (continuous mode): In addition to per-run budget caps, continuous mode supports a cumulative session-level budget via --session-budget <usd> or continuous.per_session_max_usd in config. When total spend across all runs in the session reaches this limit, the continuous loop exits cleanly with session_budget_exhausted. This is a terminal stop, not a blocker — start a new session to continue. Use agentxchain status to see cumulative spend and budget utilization for the active session.
Continuous failure handling: If a governed run reaches blocked state during continuous mode, the session pauses and preserves the blocked recovery surface (run_blocked_reason, run_blocked_recovery) instead of pretending the run completed. Operators must follow that exact surfaced recovery action, not assume every blocked session resolves with agentxchain unblock <id>. Human escalations still use unblock, but retained ghost/stale turns use the surfaced agentxchain reissue-turn --reason ghost|stale path. If a governed run fails without reaching blocked state, the continuous session stops with status: "failed" and leaves the current intake intent unresolved for operator investigation.
Continuous ghost auto-recovery: When run_loop.continuous.auto_retry_on_ghost.enabled is true, continuous mode automatically reissues retained startup ghosts (failed_start with runtime_spawn_failed or stdout_attach_failed) through the same governed reissueTurn() path. Attempts are capped by max_retries_per_run, recorded in continuous-session.json, and emitted as auto_retried_ghost events. When the cap is exhausted, the session pauses, emits ghost_retry_exhausted, and mirrors Auto-retry exhausted after N/N attempts into blocked_reason.recovery.detail while keeping the manual reissue-turn --reason ghost command visible. Non-continuous runs and opt-out continuous sessions keep the manual recovery posture.
Paused-session guard: When a continuous session is paused (awaiting the surfaced recovery action), the daemon keeps polling but does not attempt to start new work. On each poll cycle, advanceContinuousRunOnce() checks the governed project state: if still blocked, it returns still_blocked and stays paused; if the block has been resolved (via the stored recovery_action, such as agentxchain unblock <id> for needs_human or agentxchain reissue-turn --reason ghost|stale for retained ghost/stale turns), the session resumes automatically by continuing the existing governed run. The session ID stays stable across the blocked/recovered cycle — no state is lost.
Post-Dispatch Drift
When repo state changes after a turn is dispatched — HEAD moves, runtime is rebound in agentxchain.json, or authority changes — the active turn becomes stale. agentxchain status and agentxchain doctor detect this drift and surface actionable recovery commands.
| Drift Type | Cause | Detection | Recovery |
|---|---|---|---|
| Baseline drift | HEAD changed after dispatch (e.g., unrelated commit) | status shows Stale binding warning | agentxchain reissue-turn --turn <id> --reason "baseline drift" |
| Runtime drift | Runtime rebound in agentxchain.json after turn was assigned | status and doctor show runtime mismatch | agentxchain reissue-turn --turn <id> --reason "runtime rebinding" |
| Authority drift | write_authority changed on the assigned role | status shows authority mismatch | agentxchain reissue-turn --turn <id> --reason "authority change" |
reissue-turn atomically:
- Invalidates the active turn and archives it to
history.jsonl - Captures a fresh baseline from current repo state
- Creates a new turn with the same role and phase under the updated binding
- Emits a
turn_reissuedevent with old and new baseline details - Writes a decision ledger entry for auditability
After reissue, run agentxchain step --resume to dispatch the fresh turn.
reject-turn also refreshes the baseline when retrying, so a simple reject-and-retry path works for baseline drift without explicit reissue. However, reissue-turn is the canonical recovery command because it covers all drift types and produces a cleaner audit trail.
See also: Runtime Matrix for valid runtime/authority combinations.
Operator Commits After Checkpoint
If a human intentionally commits on top of the last governed checkpoint, agentxchain status reports:
Drift: Git HEAD has moved since checkpoint
Action: agentxchain reconcile-state --accept-operator-head
Use that command only for safe fast-forward operator commits. It verifies that the checkpoint baseline is still an ancestor of HEAD, rejects commits that modify .agentxchain/, rejects deletion of critical governed evidence, updates the accepted integration ref to the current HEAD, refreshes session.json.baseline_ref, and emits state_reconciled_operator_commits.
Unsafe cases still block: history rewrites, commits that edit .agentxchain/state.json, and deletion of critical evidence require explicit manual recovery or restart from a known checkpoint.
Other States
| Condition | Recovery |
|---|---|
| Latest accepted authoritative turn is not checkpointed | agentxchain checkpoint-turn --turn <id> |
| Dirty authoritative tree unrelated to the latest accepted turn | git commit or git stash, then agentxchain step |
| Unknown block | Inspect .agentxchain/state.json, resolve manually, then agentxchain step |
Accepted-turn checkpointing
Accepted authoritative turns are valid governed state, but they still leave the repo dirty until that state is checkpointed into git. AgentXchain now has a first-class checkpoint model for that boundary:
agentxchain checkpoint-turn --turn <id>commits exactly the accepted turn's declaredfiles_changedagentxchain accept-turn --checkpointaccepts the turn and immediately attempts the checkpointagentxchain run --continuousenables--auto-checkpointby default so role handoffs do not stop on manual repo commits
If the next authoritative assignment refuses with:
Accepted turn <id> is not checkpointed yet. Run agentxchain checkpoint-turn --turn <id> before assigning the next code-writing turn.
that means the repo only contains the accepted turn's uncheckpointed files. Checkpoint that turn first. Do not paper over it with git add -A; that risks mixing unrelated dirt into the acceptance boundary.
If accept-turn --checkpoint fails after acceptance, the accepted turn is preserved and the recovery path is explicit:
agentxchain checkpoint-turn --turn <id>
Command Reference
The recovery surface uses these existing commands — no dedicated recover command is needed because every blocked state maps to a specific, targeted command:
| Command | Recovery Role |
|---|---|
step --resume | Resume a blocked non-manual retained turn when you want dispatch plus waiting in one command |
resume | Resume a blocked run: re-dispatches retained turns without waiting or assigns the next turn (budget, escalation, any blocked state) |
unblock | Resolve the current human escalation record and continue the run through the correct resume path |
resume --role | Re-assign a turn after cleared block or hook failure during assignment |
approve-transition | Approve a phase gate |
approve-completion | Approve run completion |
reject-turn | Reject a failed turn and trigger retry or reassignment |
reissue-turn | Invalidate a stale turn and reissue against current state (baseline, runtime, or authority drift) |
reconcile-state --accept-operator-head | Accept safe fast-forward operator commits as the new governed baseline |
accept-turn | Manually accept a turn after validation failure and manual fix |
checkpoint-turn | Commit the latest accepted authoritative turn so the next writable role starts from a clean baseline |
escalate | Raise an operator escalation on an active or blocked turn |
mission plan launch --retry | Retry one failed repo-local turn inside a coordinator workstream |
status | View current blocked state and recovery action |
Recovery Descriptor Contract
The deriveRecoveryDescriptor() function in blocked-state.js is the canonical map from governed state to recovery action. Every blocked state MUST be registered in this function. The typed reasons are:
pending_run_completionpending_phase_transitionneeds_humandispatch_erroroperator_escalationretries_exhaustedhook_blockhook_tamperconflict_loopbudget_exhaustedpolicy_escalationtimeoutgate_action_failedunknown_block
The unknown_block fallback ensures that even unrecognized blocked_on values produce a recovery action (manual inspection).
Coordinator-Level Recovery
When repo-local governed runs are orchestrated through a multi-repo coordinator (via coordinator-bound mission plans), failures surface at both the repo level and the coordinator workstream level. The recovery model is layered:
Targeted coordinator retry
If a repo-local dispatch inside a coordinator workstream fails with a retryable state (failed or failed_acceptance), the operator can retry directly from the mission surface:
agentxchain mission plan launch latest --workstream <id> --retry
This reissues the failed repo-local turn from current HEAD, rewrites the dispatch bundle, appends retry metadata to the coordinator launch record, and executes the retried turn immediately. The original failed dispatch is preserved for audit with retried_at and retry_reason fields.
If the repo-local retry succeeds but the coordinator cannot append the matching acceptance_projection immediately, the command still exits as a retry success. In that case the JSON payload includes warnings[], warning code coordinator_acceptance_projection_incomplete, and reconciliation_required: true. That means the repo-local retry already succeeded, but the coordinator view still needs a sync before you treat downstream work as unblocked.
Inspect the durable warning with:
agentxchain events --type coordinator_retry_projection_warning
agentxchain mission plan show latest --json
The first command shows the persisted warning event after the terminal output is gone. The second command forces plan synchronization so you can confirm whether the accepted retry has now projected cleanly.
Safety guards — retry fails closed when:
- The workstream is not in
needs_attention - Multiple repo failures exist in the same workstream (only single-failure retry is supported)
- The failed repo-local turn is no longer active
- The failure state is non-retryable (
conflicted,retrying,needs_human) - A dependent workstream has already dispatched since the failed repo turn
Coordinator autopilot auto-retry
If you want the coordinator to spend a bounded retry budget during unattended waves, use:
agentxchain mission plan autopilot latest --auto-retry --max-retries 1
This does not create a second retry engine. Autopilot reuses the same narrow coordinator retry path as mission plan launch --workstream <id> --retry:
- one retryable repo failure at a time
- same downstream-dispatch safety guard
- same
coordinator_acceptance_projection_incompletewarning contract - per-session retry budget only (
--max-retriesresets when you rerun autopilot)
If the retry budget is exhausted or the retry attempt fails again, autopilot falls back to the normal coordinator failure boundary (failure_stopped or plan_incomplete). It does not loop forever and it does not invent broader cross-repo rollback semantics.
Repo-local recovery fallback
When coordinator --retry cannot be used (non-retryable state, multi-failure, or dependent dispatch conflict), fall back to repo-local recovery in the affected child repo:
- Inspect the failure:
agentxchain status/agentxchain doctorin the child repo - Apply the appropriate repo-local fix:
reissue-turn,reject-turn, orstep --resume - Return to the mission surface and resume with
mission plan launch --workstream <id>ormission plan autopilot --continue-on-failure
Dashboard visibility
The dashboard GET /api/plans endpoint exposes repo_dispatches[] with retry metadata (is_retry, retry_of, retried_at, retry_reason) for coordinator launch records, so operators can see retry history without inspecting raw plan artifacts.
See Missions — Recovering a failed coordinator workstream for the full operator walkthrough.
Auditable Recovery
Every recovery action is recorded in the decision ledger (.agentxchain/decision-ledger.jsonl):
- Escalation raised:
decision = "operator_escalated" - Escalation resolved:
decision = "escalation_resolved" - Run reactivated:
decision = "escalation_resolved"withviafield - Phase approved:
decision = "phase_transition_approved" - Policy escalation:
decision = "policy_escalation"with violating policy details - Timeout:
type = "timeout"with scope, limit, and elapsed time - Run completed:
decision = "run_completion_approved"
This ensures that recovery actions are part of the governed audit trail, not silent state mutations.