Skip to main content

Recovery

Every governed run can encounter blocked or failed states. AgentXchain ensures that every such state has an explicit recovery path — no state is terminal without an operator escape route.

How Recovery Works

When a governed run enters a blocked state, the orchestrator persists a recovery descriptor containing:

  • typed_reason — what category of block occurred
  • owner — who is responsible for resolution (always human for operator-facing blocks)
  • recovery_action — the exact command or action to take
  • turn_retained — whether the blocked turn is still assigned and can be resumed
  • runtime_guidance — optional runtime-aware blocker explanations for gate failures caused by config/runtime ownership limits

The agentxchain status command always displays the current recovery action when a run is blocked.

When the blocker truly requires human action, AgentXchain also writes a structured escalation record to .agentxchain/human-escalations.jsonl and mirrors the current open item into HUMAN_TASKS.md. status surfaces the linked escalation id together with agentxchain unblock <id>.

When the latest gate failure exposes a deeper runtime/config blocker, status also renders runtime guidance. These codes are:

CodeMeaningOperator action
invalid_bindingThe owning role/runtime binding is structurally invalid for file ownershipFix agentxchain.json, then agentxchain validate
review_only_remote_dead_endA remote review-only role owns a required artifact but can only return review artifactsFix agentxchain.json, then agentxchain validate
proposal_apply_requiredRequired files exist only in the staged proposal, not the workspace yetagentxchain proposal apply <turn_id>
tool_defined_proof_not_strong_enoughMCP/tool-defined ownership cannot be proven staticallyagentxchain role show <role_id> and inspect the tool contract

Recovery Map

Approval Gates

StateCauseRecovery
pending_phase_transitionPhase gate requires human approvalagentxchain approve-transition
pending_run_completionRun completion gate requires human approvalagentxchain approve-completion

These are governance-mandated pauses, not failures. The run resumes automatically after approval unless a configured gate action fails. When that happens, the run blocks with typed_reason: gate_action_failed and the same approval command remains the recovery action. See Gate Actions.

Dispatch Failures

StateCauseRecovery
dispatch_errorAdapter dispatch failed (API error, MCP transport failure, local CLI crash)Fix the issue, then agentxchain step --resume

The failed turn is retained. step --resume re-dispatches the same turn to the same role without re-assignment.

Escalations

StateCauseRecovery
operator_escalationOperator raised via agentxchain escalateResolve the issue, then agentxchain resume for run-level or retained manual turns; use agentxchain step --resume for retained non-manual turns
retries_exhaustedMax retries hit on a role turnagentxchain resume for retained manual turns, otherwise agentxchain step --resume

Operator escalations record decision = "operator_escalated" in the decision ledger. Resolution records decision = "escalation_resolved". If the escalation targeted one retained turn out of several, the surfaced recovery command includes --turn <id> so the operator can run it directly.

Agent Requests

StateCauseRecovery
needs_humanAgent returned blocked_on: "human:..." requesting human interventionResolve the stated issue, then agentxchain unblock <id>

The normal needs_human path accepts the turn that raised the issue and clears it, so recovery assigns the next turn after unblock instead of trying to wait on a non-existent retained turn.

If the escalation is tied to a phase-exit gate, unblock <id> also reconciles the gate before dispatch. A satisfied standing gate can advance the phase even when there is no pending_phase_transition object; AgentXchain marks the gate passed, clears stale active turns and reservations from the exited phase, emits phase_cleanup, then dispatches the next phase's entry role.

When a human escalation is raised, three notification surfaces fire automatically:

  1. events.jsonl — a human_escalation_raised event is appended with full escalation metadata (type, service, action, resolution command).
  2. stderr — a structured local notice prints the escalation ID, type, action, and unblock command. This fires regardless of webhook configuration.
  3. webhooks — if configured, a human_escalation_raised notification is delivered to all subscribed webhooks.

When the escalation is resolved via agentxchain unblock <id>, the corresponding human_escalation_resolved event is emitted to all three surfaces.

Set AGENTXCHAIN_LOCAL_NOTIFY=1 on macOS to also receive native desktop notifications.

Hook Failures

StateCauseRecovery
hook_blockLifecycle hook failed without tamperingFollow the surfaced command: approve-*, accept-turn, runtime-aware retained-turn recovery, or resume --role depending on where the hook fired
hook_tamperHook detected unauthorized file changesFix tampering, then follow the surfaced command (agentxchain resume for cleared or retained manual turns, agentxchain step --resume for retained non-manual turns, approval commands for gate pauses)

Hook blocks during assignment require resume --role to re-assign. Hook tamper no longer assumes every blocked turn is resumable with step --resume; the persisted recovery action is derived from the actual retained-turn state.

Turn Conflicts

When a turn's files_changed overlap with files accepted by another turn since the conflicting turn was assigned, the governed-state layer detects a conflict. The first two detections mark the turn as conflicted but the run stays active. On the third consecutive detection (detection count ≥ 3), the run transitions to blocked with typed_reason: 'conflict_loop'.

agentxchain status shows both resolution options, the number of conflicting files, the detection count, and the overlap percentage. The suggested resolution is based on file overlap:

  • Overlap < 50%reject_and_reassign (re-dispatch with conflict context; faster automation recovery)
  • Overlap ≥ 50%human_merge (too many overlapping files for clean re-dispatch; operator explicitly accepts the current staged result as the authoritative merge in one command)

The suggestion is guidance, not enforcement — operators may choose either path regardless of overlap.

StateCauseRecovery
conflict_loop3+ consecutive conflict detections on the same turnDefault surfaced action: agentxchain reject-turn --turn <id> --reassign; alternate single-step merge path: agentxchain accept-turn --turn <id> --resolution human_merge
ConditionRecovery
Conflicted turn (overlapping agent changes)agentxchain reject-turn --reassign or agentxchain accept-turn --resolution human_merge (one-step acceptance of the staged merge result)
Validation failure (retryable)agentxchain reject-turn then agentxchain step
Validation failure (non-retryable)Manual fix, then agentxchain accept-turn

Policy Escalations

StateCauseRecovery
policy_escalationA declarative policy with action: "escalate" fired (e.g., turn cap reached, role monopoly detected)Resolve the policy condition (e.g., increase the limit in agentxchain.json, change routing), then use agentxchain resume for retained manual turns or cleared runs; use agentxchain step --resume for retained non-manual turns

Policy escalations are distinct from operator escalations. They are triggered automatically by the declarative policy engine based on config rules, not by an operator command. The agentxchain status output shows which policy fired and what condition was violated. The decision-ledger.jsonl records decision = "policy_escalation" with the violating policy details.

Timeout

StateCauseRecovery
timeoutA turn, phase, or run exceeded its configured time limit (timeouts in agentxchain.json)Check whether the timeout is caused by a stuck agent or legitimately long work. If the limit was too tight, raise it with agentxchain config --set timeouts.per_turn_minutes <min> (or per_phase_minutes / per_run_minutes), then agentxchain resume

Timeouts are enforced when accepted work crosses the turn-acceptance boundary, and agentxchain status shows read-only timeout pressure or persisted timeout blocks. They do not currently re-fire on approve-transition or approve-completion. Per-phase timeouts can be overridden in routing with timeout_minutes and timeout_action per phase.

Approval waits are exempt from timeout mutation, but they do not stop the phase/run clock. If a run sits in pending_phase_transition or pending_run_completion for days, the next accepted turn can immediately block on timeout:phase or timeout:run. That is why status should warn during the approval pause instead of waiting for the next acceptance boundary.

For the full config and report/status surface, see Timeouts.

Budget Exhaustion

StateCauseRecovery
budget_exhaustedCumulative turn costs exceeded per_run_max_usdIncrease budget with agentxchain config --set budget.per_run_max_usd <usd>, then agentxchain resume

The turn that exhausts the budget IS accepted — its work is preserved. Only subsequent turns are blocked until the operator raises the budget limit. Use agentxchain config --set budget.per_run_max_usd <usd> for that simple recovery instead of hand-editing JSON. Because there is no retained turn, budget recovery re-assigns the next turn with agentxchain resume instead of replaying work with step --resume.

To prevent hard stops on budget exhaustion, switch to warn mode with agentxchain config --set budget.on_exceed warn. In warn mode the run continues past budget with observable warnings in status, events, and notifications instead of blocking.

Session-level budget (continuous mode): In addition to per-run budget caps, continuous mode supports a cumulative session-level budget via --session-budget <usd> or continuous.per_session_max_usd in config. When total spend across all runs in the session reaches this limit, the continuous loop exits cleanly with session_budget_exhausted. This is a terminal stop, not a blocker — start a new session to continue. Use agentxchain status to see cumulative spend and budget utilization for the active session.

Continuous failure handling: If a governed run reaches blocked state during continuous mode, the session pauses and preserves the blocked recovery surface (run_blocked_reason, run_blocked_recovery) instead of pretending the run completed. Operators must follow that exact surfaced recovery action, not assume every blocked session resolves with agentxchain unblock <id>. Human escalations still use unblock, but retained ghost/stale turns use the surfaced agentxchain reissue-turn --reason ghost|stale path. If a governed run fails without reaching blocked state, the continuous session stops with status: "failed" and leaves the current intake intent unresolved for operator investigation.

Continuous ghost auto-recovery: When run_loop.continuous.auto_retry_on_ghost.enabled is true, continuous mode automatically reissues retained startup ghosts (failed_start with runtime_spawn_failed or stdout_attach_failed) through the same governed reissueTurn() path. Attempts are capped by max_retries_per_run, recorded in continuous-session.json, and emitted as auto_retried_ghost events. When the cap is exhausted, the session pauses, emits ghost_retry_exhausted, and mirrors Auto-retry exhausted after N/N attempts into blocked_reason.recovery.detail while keeping the manual reissue-turn --reason ghost command visible. Non-continuous runs and opt-out continuous sessions keep the manual recovery posture.

Paused-session guard: When a continuous session is paused (awaiting the surfaced recovery action), the daemon keeps polling but does not attempt to start new work. On each poll cycle, advanceContinuousRunOnce() checks the governed project state: if still blocked, it returns still_blocked and stays paused; if the block has been resolved (via the stored recovery_action, such as agentxchain unblock <id> for needs_human or agentxchain reissue-turn --reason ghost|stale for retained ghost/stale turns), the session resumes automatically by continuing the existing governed run. The session ID stays stable across the blocked/recovered cycle — no state is lost.

Post-Dispatch Drift

When repo state changes after a turn is dispatched — HEAD moves, runtime is rebound in agentxchain.json, or authority changes — the active turn becomes stale. agentxchain status and agentxchain doctor detect this drift and surface actionable recovery commands.

Drift TypeCauseDetectionRecovery
Baseline driftHEAD changed after dispatch (e.g., unrelated commit)status shows Stale binding warningagentxchain reissue-turn --turn <id> --reason "baseline drift"
Runtime driftRuntime rebound in agentxchain.json after turn was assignedstatus and doctor show runtime mismatchagentxchain reissue-turn --turn <id> --reason "runtime rebinding"
Authority driftwrite_authority changed on the assigned rolestatus shows authority mismatchagentxchain reissue-turn --turn <id> --reason "authority change"

reissue-turn atomically:

  1. Invalidates the active turn and archives it to history.jsonl
  2. Captures a fresh baseline from current repo state
  3. Creates a new turn with the same role and phase under the updated binding
  4. Emits a turn_reissued event with old and new baseline details
  5. Writes a decision ledger entry for auditability

After reissue, run agentxchain step --resume to dispatch the fresh turn.

reject-turn also refreshes the baseline when retrying, so a simple reject-and-retry path works for baseline drift without explicit reissue. However, reissue-turn is the canonical recovery command because it covers all drift types and produces a cleaner audit trail.

See also: Runtime Matrix for valid runtime/authority combinations.

Operator Commits After Checkpoint

If a human intentionally commits on top of the last governed checkpoint, agentxchain status reports:

Drift: Git HEAD has moved since checkpoint
Action: agentxchain reconcile-state --accept-operator-head

Use that command only for safe fast-forward operator commits. It verifies that the checkpoint baseline is still an ancestor of HEAD, rejects commits that modify .agentxchain/, rejects deletion of critical governed evidence, updates the accepted integration ref to the current HEAD, refreshes session.json.baseline_ref, and emits state_reconciled_operator_commits.

Unsafe cases still block: history rewrites, commits that edit .agentxchain/state.json, and deletion of critical evidence require explicit manual recovery or restart from a known checkpoint.

Other States

ConditionRecovery
Latest accepted authoritative turn is not checkpointedagentxchain checkpoint-turn --turn <id>
Dirty authoritative tree unrelated to the latest accepted turngit commit or git stash, then agentxchain step
Unknown blockInspect .agentxchain/state.json, resolve manually, then agentxchain step

Accepted-turn checkpointing

Accepted authoritative turns are valid governed state, but they still leave the repo dirty until that state is checkpointed into git. AgentXchain now has a first-class checkpoint model for that boundary:

  • agentxchain checkpoint-turn --turn <id> commits exactly the accepted turn's declared files_changed
  • agentxchain accept-turn --checkpoint accepts the turn and immediately attempts the checkpoint
  • agentxchain run --continuous enables --auto-checkpoint by default so role handoffs do not stop on manual repo commits

If the next authoritative assignment refuses with:

Accepted turn <id> is not checkpointed yet. Run agentxchain checkpoint-turn --turn <id> before assigning the next code-writing turn.

that means the repo only contains the accepted turn's uncheckpointed files. Checkpoint that turn first. Do not paper over it with git add -A; that risks mixing unrelated dirt into the acceptance boundary.

If accept-turn --checkpoint fails after acceptance, the accepted turn is preserved and the recovery path is explicit:

agentxchain checkpoint-turn --turn <id>

Command Reference

The recovery surface uses these existing commands — no dedicated recover command is needed because every blocked state maps to a specific, targeted command:

CommandRecovery Role
step --resumeResume a blocked non-manual retained turn when you want dispatch plus waiting in one command
resumeResume a blocked run: re-dispatches retained turns without waiting or assigns the next turn (budget, escalation, any blocked state)
unblockResolve the current human escalation record and continue the run through the correct resume path
resume --roleRe-assign a turn after cleared block or hook failure during assignment
approve-transitionApprove a phase gate
approve-completionApprove run completion
reject-turnReject a failed turn and trigger retry or reassignment
reissue-turnInvalidate a stale turn and reissue against current state (baseline, runtime, or authority drift)
reconcile-state --accept-operator-headAccept safe fast-forward operator commits as the new governed baseline
accept-turnManually accept a turn after validation failure and manual fix
checkpoint-turnCommit the latest accepted authoritative turn so the next writable role starts from a clean baseline
escalateRaise an operator escalation on an active or blocked turn
mission plan launch --retryRetry one failed repo-local turn inside a coordinator workstream
statusView current blocked state and recovery action

Recovery Descriptor Contract

The deriveRecoveryDescriptor() function in blocked-state.js is the canonical map from governed state to recovery action. Every blocked state MUST be registered in this function. The typed reasons are:

  • pending_run_completion
  • pending_phase_transition
  • needs_human
  • dispatch_error
  • operator_escalation
  • retries_exhausted
  • hook_block
  • hook_tamper
  • conflict_loop
  • budget_exhausted
  • policy_escalation
  • timeout
  • gate_action_failed
  • unknown_block

The unknown_block fallback ensures that even unrecognized blocked_on values produce a recovery action (manual inspection).

Coordinator-Level Recovery

When repo-local governed runs are orchestrated through a multi-repo coordinator (via coordinator-bound mission plans), failures surface at both the repo level and the coordinator workstream level. The recovery model is layered:

Targeted coordinator retry

If a repo-local dispatch inside a coordinator workstream fails with a retryable state (failed or failed_acceptance), the operator can retry directly from the mission surface:

agentxchain mission plan launch latest --workstream <id> --retry

This reissues the failed repo-local turn from current HEAD, rewrites the dispatch bundle, appends retry metadata to the coordinator launch record, and executes the retried turn immediately. The original failed dispatch is preserved for audit with retried_at and retry_reason fields.

If the repo-local retry succeeds but the coordinator cannot append the matching acceptance_projection immediately, the command still exits as a retry success. In that case the JSON payload includes warnings[], warning code coordinator_acceptance_projection_incomplete, and reconciliation_required: true. That means the repo-local retry already succeeded, but the coordinator view still needs a sync before you treat downstream work as unblocked.

Inspect the durable warning with:

agentxchain events --type coordinator_retry_projection_warning
agentxchain mission plan show latest --json

The first command shows the persisted warning event after the terminal output is gone. The second command forces plan synchronization so you can confirm whether the accepted retry has now projected cleanly.

Safety guards — retry fails closed when:

  • The workstream is not in needs_attention
  • Multiple repo failures exist in the same workstream (only single-failure retry is supported)
  • The failed repo-local turn is no longer active
  • The failure state is non-retryable (conflicted, retrying, needs_human)
  • A dependent workstream has already dispatched since the failed repo turn

Coordinator autopilot auto-retry

If you want the coordinator to spend a bounded retry budget during unattended waves, use:

agentxchain mission plan autopilot latest --auto-retry --max-retries 1

This does not create a second retry engine. Autopilot reuses the same narrow coordinator retry path as mission plan launch --workstream <id> --retry:

  • one retryable repo failure at a time
  • same downstream-dispatch safety guard
  • same coordinator_acceptance_projection_incomplete warning contract
  • per-session retry budget only (--max-retries resets when you rerun autopilot)

If the retry budget is exhausted or the retry attempt fails again, autopilot falls back to the normal coordinator failure boundary (failure_stopped or plan_incomplete). It does not loop forever and it does not invent broader cross-repo rollback semantics.

Repo-local recovery fallback

When coordinator --retry cannot be used (non-retryable state, multi-failure, or dependent dispatch conflict), fall back to repo-local recovery in the affected child repo:

  1. Inspect the failure: agentxchain status / agentxchain doctor in the child repo
  2. Apply the appropriate repo-local fix: reissue-turn, reject-turn, or step --resume
  3. Return to the mission surface and resume with mission plan launch --workstream <id> or mission plan autopilot --continue-on-failure

Dashboard visibility

The dashboard GET /api/plans endpoint exposes repo_dispatches[] with retry metadata (is_retry, retry_of, retried_at, retry_reason) for coordinator launch records, so operators can see retry history without inspecting raw plan artifacts.

See Missions — Recovering a failed coordinator workstream for the full operator walkthrough.

Auditable Recovery

Every recovery action is recorded in the decision ledger (.agentxchain/decision-ledger.jsonl):

  • Escalation raised: decision = "operator_escalated"
  • Escalation resolved: decision = "escalation_resolved"
  • Run reactivated: decision = "escalation_resolved" with via field
  • Phase approved: decision = "phase_transition_approved"
  • Policy escalation: decision = "policy_escalation" with violating policy details
  • Timeout: type = "timeout" with scope, limit, and elapsed time
  • Run completed: decision = "run_completion_approved"

This ensures that recovery actions are part of the governed audit trail, not silent state mutations.