Skip to main content

AgentXchain v2.102.0

v2.102.0 ships a named benchmark workload catalog with topology-aware discovery and regression-triggering proof.

Released: 2026-04-15

Named Benchmark Workload Catalog

agentxchain benchmark now uses a named workload catalog instead of boolean mode flags. Each workload declares its own phase topology, expected governance signals, and proof expectations.

Built-in Workloads

WorkloadPhasesPurpose
baselineplanning → implementation → qaStandard governed run — happy path proof
stressplanning → implementation → qaAdmission control with validator rejection and retry
completion-recoveryplanning → implementation → qaGate failure on missing artifact, reassignment, and recovery
phase-driftplanning → design → implementation → qaExtra phase topology — triggers REG-PHASE-ORDER-001 when diffed against baseline

Workload Selection

# Run a specific workload
agentxchain benchmark --workload completion-recovery

# Legacy alias still works
agentxchain benchmark --stress
# equivalent to: agentxchain benchmark --workload stress

# Conflicting flags fail closed
agentxchain benchmark --stress --workload baseline # → error

Workload Discovery

New benchmark workloads subcommand exposes the full catalog with topology metadata:

# Human-readable listing
agentxchain benchmark workloads

# Structured catalog with phase order, expected signals, descriptions
agentxchain benchmark workloads --json

JSON output includes phase_order, phase_count, and expected governance signals for each workload — operators no longer need to read source code or scrape --help output to understand what each workload proves.

Phase-Drift Regression Proof

The phase-drift workload proves that the benchmark can trigger its own regression engine, not just demonstrate regression-free comparisons:

# Run both workloads with saved artifacts
agentxchain benchmark --workload baseline --output /tmp/bench-baseline
agentxchain benchmark --workload phase-drift --output /tmp/bench-drift

# Diff triggers real regression detection
agentxchain verify diff /tmp/bench-baseline/run-export.json /tmp/bench-drift/run-export.json
# → exit 1, has_regressions: true, REG-PHASE-ORDER-001

This is a fundamentally different diff shape: baseline has ["planning", "implementation", "qa"] while phase-drift has ["planning", "design", "implementation", "qa"].

Those saved benchmark artifacts are still repo-local run exports today, not coordinator exports. This release is proving workload and phase-regression behavior through verify diff; it is not quietly exercising coordinator repo-status logic.

Completion-Recovery Workload

Proves gate-failure recovery — a different failure class than validator rejection:

  1. QA attempts completion
  2. Gate check fails (missing .planning/ship-verdict.md)
  3. QA is reassigned
  4. Missing artifact is created
  5. Completion succeeds through the real approval path

Durable Benchmark Artifacts

--output <dir> persists proof artifacts that round-trip through the public verification surface:

  • metrics.json — timing and governance signal counts
  • run-export.json — full run export (verifiable via agentxchain verify export)
  • verify-export.json — verification report
  • workload.json — workload metadata including topology

If operators later compare coordinator exports outside the benchmark flow, the same verify diff truth boundary still applies: summary.repo_run_statuses remains raw coordinator snapshot metadata, while coordinator repo-status changes and regressions come from authority-first child repo status when nested child exports are readable.

Topology-Aware Config Generation

Benchmark config generation now resolves from workload-declared phase specs instead of mutating a hardcoded base config. The makeConfig() builder derives roles, runtimes, routing, and gates from the workload's phase_order and custom_phases declarations. This is extensible to arbitrary topologies without per-workload branches in the command body.

Admission Control Cleanup

Pre-run admission control now rejects dead-end governed configs (configurations where no agent can make progress) before starting a run, saving operator time. The legacy collectRemoteReviewOnlyGateWarnings helper was removed — its validation responsibility is now handled cleanly by the admission control system.

Evidence

  • 4675 tests / 1000 suites / 0 failures
  • 20 benchmark tests (AT-BENCH-001 through AT-BENCH-020)
  • 7 docs guard tests
  • Docusaurus build: success