Recovery

Every governed run can encounter blocked or failed states. AgentXchain ensures that every such state has an explicit recovery path — no state is terminal without an operator escape route.

How Recovery Works

When a governed run enters a blocked state, the orchestrator persists a recovery descriptor containing:

typed_reason — what category of block occurred
owner — who is responsible for resolution (always human for operator-facing blocks)
recovery_action — the exact command or action to take
turn_retained — whether the blocked turn is still assigned and can be resumed
runtime_guidance — optional runtime-aware blocker explanations for gate failures caused by config/runtime ownership limits

The agentxchain status command always displays the current recovery action when a run is blocked.

When the blocker truly requires human action, AgentXchain also writes a structured escalation record to .agentxchain/human-escalations.jsonl and mirrors the current open item into HUMAN_TASKS.md. status surfaces the linked escalation id together with agentxchain unblock <id>.

When the latest gate failure exposes a deeper runtime/config blocker, status also renders runtime guidance. These codes are:

Code	Meaning	Operator action
`invalid_binding`	The owning role/runtime binding is structurally invalid for file ownership	Fix `agentxchain.json`, then `agentxchain validate`
`review_only_remote_dead_end`	A remote review-only role owns a required artifact but can only return review artifacts	Fix `agentxchain.json`, then `agentxchain validate`
`proposal_apply_required`	Required files exist only in the staged proposal, not the workspace yet	`agentxchain proposal apply <turn_id>`
`tool_defined_proof_not_strong_enough`	MCP/tool-defined ownership cannot be proven statically	`agentxchain role show <role_id>` and inspect the tool contract

Recovery Map

Approval Gates

State	Cause	Recovery
`pending_phase_transition`	Phase gate requires human approval	`agentxchain approve-transition`
`pending_run_completion`	Run completion gate requires human approval	`agentxchain approve-completion`

These are governance-mandated pauses, not failures. The run resumes automatically after approval unless a configured gate action fails. When that happens, the run blocks with typed_reason: gate_action_failed and the same approval command remains the recovery action. See Gate Actions.

Dispatch Failures

State	Cause	Recovery
`dispatch_error`	Adapter dispatch failed (API error, MCP transport failure, local CLI crash)	Fix the issue, then `agentxchain step --resume`

The failed turn is retained. step --resume re-dispatches the same turn to the same role without re-assignment.

Escalations

State	Cause	Recovery
`operator_escalation`	Operator raised via `agentxchain escalate`	Resolve the issue, then `agentxchain resume` for run-level or retained `manual` turns; use `agentxchain step --resume` for retained non-manual turns
`retries_exhausted`	Max retries hit on a role turn	`agentxchain resume` for retained `manual` turns, otherwise `agentxchain step --resume`

Operator escalations record decision = "operator_escalated" in the decision ledger. Resolution records decision = "escalation_resolved". If the escalation targeted one retained turn out of several, the surfaced recovery command includes --turn <id> so the operator can run it directly.

Agent Requests

State	Cause	Recovery
`needs_human`	Agent returned `blocked_on: "human:..."` requesting human intervention	Resolve the stated issue, then `agentxchain unblock <id>`

The normal needs_human path accepts the turn that raised the issue and clears it, so recovery assigns the next turn after unblock instead of trying to wait on a non-existent retained turn.

If the escalation is tied to a phase-exit gate, unblock <id> also reconciles the gate before dispatch. A satisfied standing gate can advance the phase even when there is no pending_phase_transition object; AgentXchain marks the gate passed, clears stale active turns and reservations from the exited phase, emits phase_cleanup, then dispatches the next phase's entry role.

When a human escalation is raised, three notification surfaces fire automatically:

events.jsonl — a human_escalation_raised event is appended with full escalation metadata (type, service, action, resolution command).
stderr — a structured local notice prints the escalation ID, type, action, and unblock command. This fires regardless of webhook configuration.
webhooks — if configured, a human_escalation_raised notification is delivered to all subscribed webhooks.

When the escalation is resolved via agentxchain unblock <id>, the corresponding human_escalation_resolved event is emitted to all three surfaces.

Set AGENTXCHAIN_LOCAL_NOTIFY=1 on macOS to also receive native desktop notifications.

Hook Failures

State	Cause	Recovery
`hook_block`	Lifecycle hook failed without tampering	Follow the surfaced command: `approve-*`, `accept-turn`, runtime-aware retained-turn recovery, or `resume --role` depending on where the hook fired
`hook_tamper`	Hook detected unauthorized file changes	Fix tampering, then follow the surfaced command (`agentxchain resume` for cleared or retained `manual` turns, `agentxchain step --resume` for retained non-manual turns, approval commands for gate pauses)

Hook blocks during assignment require resume --role to re-assign. Hook tamper no longer assumes every blocked turn is resumable with step --resume; the persisted recovery action is derived from the actual retained-turn state.

Turn Conflicts

When a turn's files_changed overlap with files accepted by another turn since the conflicting turn was assigned, the governed-state layer detects a conflict. The first two detections mark the turn as conflicted but the run stays active. On the third consecutive detection (detection count ≥ 3), the run transitions to blocked with typed_reason: 'conflict_loop'.

agentxchain status shows both resolution options, the number of conflicting files, the detection count, and the overlap percentage. The suggested resolution is based on file overlap:

Overlap < 50% → reject_and_reassign (re-dispatch with conflict context; faster automation recovery)
Overlap ≥ 50% → human_merge (too many overlapping files for clean re-dispatch; operator explicitly accepts the current staged result as the authoritative merge in one command)

The suggestion is guidance, not enforcement — operators may choose either path regardless of overlap.

State	Cause	Recovery
`conflict_loop`	3+ consecutive conflict detections on the same turn	Default surfaced action: `agentxchain reject-turn --turn <id> --reassign`; alternate single-step merge path: `agentxchain accept-turn --turn <id> --resolution human_merge`

Condition	Recovery
Conflicted turn (overlapping agent changes)	`agentxchain reject-turn --reassign` or `agentxchain accept-turn --resolution human_merge` (one-step acceptance of the staged merge result)
Validation failure (retryable)	`agentxchain reject-turn` then `agentxchain step`
Validation failure (non-retryable)	Manual fix, then `agentxchain accept-turn`

Policy Escalations

State	Cause	Recovery
`policy_escalation`	A declarative policy with `action: "escalate"` fired (e.g., turn cap reached, role monopoly detected)	Resolve the policy condition (e.g., increase the limit in `agentxchain.json`, change routing), then use `agentxchain resume` for retained `manual` turns or cleared runs; use `agentxchain step --resume` for retained non-manual turns

Policy escalations are distinct from operator escalations. They are triggered automatically by the declarative policy engine based on config rules, not by an operator command. The agentxchain status output shows which policy fired and what condition was violated. The decision-ledger.jsonl records decision = "policy_escalation" with the violating policy details.

Timeout

State	Cause	Recovery
`timeout`	A turn, phase, or run exceeded its configured time limit (`timeouts` in `agentxchain.json`)	Check whether the timeout is caused by a stuck agent or legitimately long work. If the limit was too tight, raise it with `agentxchain config --set timeouts.per_turn_minutes <min>` (or `per_phase_minutes` / `per_run_minutes`), then `agentxchain resume`

Timeouts are enforced when accepted work crosses the turn-acceptance boundary, and agentxchain status shows read-only timeout pressure or persisted timeout blocks. They do not currently re-fire on approve-transition or approve-completion. Per-phase timeouts can be overridden in routing with timeout_minutes and timeout_action per phase.

Approval waits are exempt from timeout mutation, but they do not stop the phase/run clock. If a run sits in pending_phase_transition or pending_run_completion for days, the next accepted turn can immediately block on timeout:phase or timeout:run. That is why status should warn during the approval pause instead of waiting for the next acceptance boundary.

For the full config and report/status surface, see Timeouts.

Budget Exhaustion

State	Cause	Recovery
`budget_exhausted`	Cumulative turn costs exceeded `per_run_max_usd`	Increase budget with `agentxchain config --set budget.per_run_max_usd <usd>`, then `agentxchain resume`

The turn that exhausts the budget IS accepted — its work is preserved. Only subsequent turns are blocked until the operator raises the budget limit. Use agentxchain config --set budget.per_run_max_usd <usd> for that simple recovery instead of hand-editing JSON. Because there is no retained turn, budget recovery re-assigns the next turn with agentxchain resume instead of replaying work with step --resume.

To prevent hard stops on budget exhaustion, switch to warn mode with agentxchain config --set budget.on_exceed warn. In warn mode the run continues past budget with observable warnings in status, events, and notifications instead of blocking.

Session-level budget (continuous mode): In addition to per-run budget caps, continuous mode supports a cumulative session-level budget via --session-budget <usd> or continuous.per_session_max_usd in config. When total spend across all runs in the session reaches this limit, the continuous loop exits cleanly with session_budget_exhausted. This is a terminal stop, not a blocker — start a new session to continue. Use agentxchain status to see cumulative spend and budget utilization for the active session.

Continuous failure handling: If a governed run reaches blocked state during continuous mode, the session pauses and preserves the blocked recovery surface (run_blocked_reason, run_blocked_recovery) instead of pretending the run completed. Operators must follow that exact surfaced recovery action, not assume every blocked session resolves with agentxchain unblock <id>. Human escalations still use unblock, but retained ghost/stale turns use the surfaced agentxchain reissue-turn --reason ghost|stale path. If a governed run fails without reaching blocked state, the continuous session stops with status: "failed" and leaves the current intake intent unresolved for operator investigation.

Continuous ghost auto-recovery: When run_loop.continuous.auto_retry_on_ghost.enabled is true, continuous mode automatically reissues retained startup ghosts (failed_start with runtime_spawn_failed or stdout_attach_failed) through the same governed reissueTurn() path. Attempts are capped by max_retries_per_run, recorded in continuous-session.json, and emitted as auto_retried_ghost events. When the cap is exhausted, the session pauses, emits ghost_retry_exhausted, and mirrors Auto-retry exhausted after N/N attempts into blocked_reason.recovery.detail while keeping the manual reissue-turn --reason ghost command visible. Non-continuous runs and opt-out continuous sessions keep the manual recovery posture.

Paused-session guard: When a continuous session is paused (awaiting the surfaced recovery action), the daemon keeps polling but does not attempt to start new work. On each poll cycle, advanceContinuousRunOnce() checks the governed project state: if still blocked, it returns still_blocked and stays paused; if the block has been resolved (via the stored recovery_action, such as agentxchain unblock <id> for needs_human or agentxchain reissue-turn --reason ghost|stale for retained ghost/stale turns), the session resumes automatically by continuing the existing governed run. The session ID stays stable across the blocked/recovered cycle — no state is lost.

Post-Dispatch Drift

When repo state changes after a turn is dispatched — HEAD moves, runtime is rebound in agentxchain.json, or authority changes — the active turn becomes stale. agentxchain status and agentxchain doctor detect this drift and surface actionable recovery commands.

Drift Type	Cause	Detection	Recovery
Baseline drift	`HEAD` changed after dispatch (e.g., unrelated commit)	`status` shows `Stale binding` warning	`agentxchain reissue-turn --turn <id> --reason "baseline drift"`
Runtime drift	Runtime rebound in `agentxchain.json` after turn was assigned	`status` and `doctor` show runtime mismatch	`agentxchain reissue-turn --turn <id> --reason "runtime rebinding"`
Authority drift	`write_authority` changed on the assigned role	`status` shows authority mismatch	`agentxchain reissue-turn --turn <id> --reason "authority change"`

reissue-turn atomically:

Invalidates the active turn and archives it to history.jsonl
Captures a fresh baseline from current repo state
Creates a new turn with the same role and phase under the updated binding
Emits a turn_reissued event with old and new baseline details
Writes a decision ledger entry for auditability

After reissue, run agentxchain step --resume to dispatch the fresh turn.

reject-turn also refreshes the baseline when retrying, so a simple reject-and-retry path works for baseline drift without explicit reissue. However, reissue-turn is the canonical recovery command because it covers all drift types and produces a cleaner audit trail.

See also: Runtime Matrix for valid runtime/authority combinations.

Operator Commits After Checkpoint

If a human intentionally commits on top of the last governed checkpoint, agentxchain status reports:

Drift: Git HEAD has moved since checkpoint
Action: agentxchain reconcile-state --accept-operator-head

Use that command only for safe fast-forward operator commits. It verifies that the checkpoint baseline is still an ancestor of HEAD, rejects commits that modify .agentxchain/, rejects deletion of critical governed evidence, updates the accepted integration ref to the current HEAD, refreshes session.json.baseline_ref, and emits state_reconciled_operator_commits.

Unsafe cases still block: history rewrites, commits that edit .agentxchain/state.json, and deletion of critical evidence require explicit manual recovery or restart from a known checkpoint.

Other States

Condition	Recovery
Latest accepted authoritative turn is not checkpointed	`agentxchain checkpoint-turn --turn <id>`
Dirty authoritative tree unrelated to the latest accepted turn	`git commit` or `git stash`, then `agentxchain step`
Unknown block	Inspect `.agentxchain/state.json`, resolve manually, then `agentxchain step`

Accepted-turn checkpointing

Accepted authoritative turns are valid governed state, but they still leave the repo dirty until that state is checkpointed into git. AgentXchain now has a first-class checkpoint model for that boundary:

agentxchain checkpoint-turn --turn <id> commits exactly the accepted turn's declared files_changed
agentxchain accept-turn --checkpoint accepts the turn and immediately attempts the checkpoint
agentxchain run --continuous enables --auto-checkpoint by default so role handoffs do not stop on manual repo commits

If the next authoritative assignment refuses with:

Accepted turn <id> is not checkpointed yet. Run agentxchain checkpoint-turn --turn <id> before assigning the next code-writing turn.

that means the repo only contains the accepted turn's uncheckpointed files. Checkpoint that turn first. Do not paper over it with git add -A; that risks mixing unrelated dirt into the acceptance boundary.

If accept-turn --checkpoint fails after acceptance, the accepted turn is preserved and the recovery path is explicit:

agentxchain checkpoint-turn --turn <id>

Command Reference

The recovery surface uses these existing commands — no dedicated recover command is needed because every blocked state maps to a specific, targeted command:

Command	Recovery Role
`step --resume`	Resume a blocked non-manual retained turn when you want dispatch plus waiting in one command
`resume`	Resume a blocked run: re-dispatches retained turns without waiting or assigns the next turn (budget, escalation, any blocked state)
`unblock`	Resolve the current human escalation record and continue the run through the correct `resume` path
`resume --role`	Re-assign a turn after cleared block or hook failure during assignment
`approve-transition`	Approve a phase gate
`approve-completion`	Approve run completion
`reject-turn`	Reject a failed turn and trigger retry or reassignment
`reissue-turn`	Invalidate a stale turn and reissue against current state (baseline, runtime, or authority drift)
`reconcile-state --accept-operator-head`	Accept safe fast-forward operator commits as the new governed baseline
`accept-turn`	Manually accept a turn after validation failure and manual fix
`checkpoint-turn`	Commit the latest accepted authoritative turn so the next writable role starts from a clean baseline
`escalate`	Raise an operator escalation on an active or blocked turn
`mission plan launch --retry`	Retry one failed repo-local turn inside a coordinator workstream
`status`	View current blocked state and recovery action

Recovery Descriptor Contract

The deriveRecoveryDescriptor() function in blocked-state.js is the canonical map from governed state to recovery action. Every blocked state MUST be registered in this function. The typed reasons are:

pending_run_completion
pending_phase_transition
needs_human
dispatch_error
operator_escalation
retries_exhausted
hook_block
hook_tamper
conflict_loop
budget_exhausted
policy_escalation
timeout
gate_action_failed
unknown_block

The unknown_block fallback ensures that even unrecognized blocked_on values produce a recovery action (manual inspection).

Coordinator-Level Recovery

When repo-local governed runs are orchestrated through a multi-repo coordinator (via coordinator-bound mission plans), failures surface at both the repo level and the coordinator workstream level. The recovery model is layered:

Targeted coordinator retry

If a repo-local dispatch inside a coordinator workstream fails with a retryable state (failed or failed_acceptance), the operator can retry directly from the mission surface:

agentxchain mission plan launch latest --workstream <id> --retry

This reissues the failed repo-local turn from current HEAD, rewrites the dispatch bundle, appends retry metadata to the coordinator launch record, and executes the retried turn immediately. The original failed dispatch is preserved for audit with retried_at and retry_reason fields.

If the repo-local retry succeeds but the coordinator cannot append the matching acceptance_projection immediately, the command still exits as a retry success. In that case the JSON payload includes warnings[], warning code coordinator_acceptance_projection_incomplete, and reconciliation_required: true. That means the repo-local retry already succeeded, but the coordinator view still needs a sync before you treat downstream work as unblocked.

Inspect the durable warning with:

agentxchain events --type coordinator_retry_projection_warning
agentxchain mission plan show latest --json

The first command shows the persisted warning event after the terminal output is gone. The second command forces plan synchronization so you can confirm whether the accepted retry has now projected cleanly.

Safety guards — retry fails closed when:

The workstream is not in needs_attention
Multiple repo failures exist in the same workstream (only single-failure retry is supported)
The failed repo-local turn is no longer active
The failure state is non-retryable (conflicted, retrying, needs_human)
A dependent workstream has already dispatched since the failed repo turn

Coordinator autopilot auto-retry

If you want the coordinator to spend a bounded retry budget during unattended waves, use:

agentxchain mission plan autopilot latest --auto-retry --max-retries 1

This does not create a second retry engine. Autopilot reuses the same narrow coordinator retry path as mission plan launch --workstream <id> --retry:

one retryable repo failure at a time
same downstream-dispatch safety guard
same coordinator_acceptance_projection_incomplete warning contract
per-session retry budget only (--max-retries resets when you rerun autopilot)

If the retry budget is exhausted or the retry attempt fails again, autopilot falls back to the normal coordinator failure boundary (failure_stopped or plan_incomplete). It does not loop forever and it does not invent broader cross-repo rollback semantics.

Repo-local recovery fallback

When coordinator --retry cannot be used (non-retryable state, multi-failure, or dependent dispatch conflict), fall back to repo-local recovery in the affected child repo:

Inspect the failure: agentxchain status / agentxchain doctor in the child repo
Apply the appropriate repo-local fix: reissue-turn, reject-turn, or step --resume
Return to the mission surface and resume with mission plan launch --workstream <id> or mission plan autopilot --continue-on-failure

Dashboard visibility

The dashboard GET /api/plans endpoint exposes repo_dispatches[] with retry metadata (is_retry, retry_of, retried_at, retry_reason) for coordinator launch records, so operators can see retry history without inspecting raw plan artifacts.

See Missions — Recovering a failed coordinator workstream for the full operator walkthrough.

Auditable Recovery

Every recovery action is recorded in the decision ledger (.agentxchain/decision-ledger.jsonl):

Escalation raised: decision = "operator_escalated"
Escalation resolved: decision = "escalation_resolved"
Run reactivated: decision = "escalation_resolved" with via field
Phase approved: decision = "phase_transition_approved"
Policy escalation: decision = "policy_escalation" with violating policy details
Timeout: type = "timeout" with scope, limit, and elapsed time
Run completed: decision = "run_completion_approved"

This ensures that recovery actions are part of the governed audit trail, not silent state mutations.

How Recovery Works​

Recovery Map​

Approval Gates​

Dispatch Failures​

Escalations​

Agent Requests​

Hook Failures​

Turn Conflicts​

Policy Escalations​

Timeout​

Budget Exhaustion​

Post-Dispatch Drift​

Operator Commits After Checkpoint​

Other States​

Accepted-turn checkpointing​

Command Reference​

Recovery Descriptor Contract​

Coordinator-Level Recovery​

Targeted coordinator retry​

Coordinator autopilot auto-retry​

Repo-local recovery fallback​

Dashboard visibility​

Auditable Recovery​