Files
boss/docs/superpowers/plans/2026-06-06-boss-edge-reliability.md
2026-06-08 12:22:50 +08:00

3.4 KiB

Boss Edge Reliability Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add the first production reliability shell for Boss task execution without changing the deployment topology.

Architecture: Keep Boss Cloud and the current local-agent, but make local-agent behave like a lightweight Boss Edge by adding a durable outbox and explicit task phases. Cloud-side task APIs keep leases and add watchdog cleanup so APP progress never stays ambiguous forever.

Tech Stack: Next.js API routes, file-backed Boss state, Node local-agent, Codex App Server runner, Node test runner.


Task 1: Task Phase Contract

Files:

  • Modify: src/lib/boss-data.ts

  • Test: src/lib/boss-data-reliability.test.ts

  • Add MasterAgentTaskPhase and normalized fields on MasterAgentTask: phase, lastProgressAt, lastErrorCode, recoverable, nextRetryAt.

  • Update task normalization so old state files default queued -> queued, running -> claimed, terminal states preserve terminal phase.

  • Update execution progress card generation to derive step status from phase when available.

  • Test that executor_starting, turn_started, awaiting_reply, completing, and recoverable_failed map to visible progress steps.

Task 2: Local Agent Durable Outbox

Files:

  • Create: local-agent/reliable-outbox.mjs

  • Modify: local-agent/server.mjs

  • Test: local-agent/reliable-outbox.test.mjs

  • Implement JSONL-backed outbox with append, list pending, mark sent, and compaction.

  • Wrap postMasterAgentTaskProgress, completeMasterAgentTask, and postAppLog so payloads are persisted before network send.

  • Replay pending records on startup and every heartbeat loop.

  • Preserve idempotency keys using taskId + event kind + phase + createdAt.

Task 3: Cloud Watchdog

Files:

  • Modify: src/lib/boss-data.ts

  • Test: src/lib/boss-data-reliability.test.ts

  • Add a lightweight watchdog function invoked during claim, progress, complete, and heartbeat-derived writes.

  • Expire stale user conversation tasks older than 1 hour while still queued.

  • Convert stale running tasks without progress into recoverable_failed if turn has not started, otherwise timed_out.

  • Ensure late complete cannot overwrite terminal states.

Task 4: Executor Health Grading

Files:

  • Modify: src/lib/boss-data.ts

  • Modify: local-agent/codex-app-server-runner.mjs

  • Test: src/lib/boss-data-reliability.test.ts

  • Derive codexAppServerHealth as available / degraded / unavailable from heartbeat metadata and recent errors.

  • Allow GUI-preferred task claim only when health is not unavailable.

  • Mark app-server stdio closed and timeout errors as degraded for the next heartbeat.

Task 5: Verification

Files:

  • Modify: docs/architecture/current_runtime_and_deploy_status_cn.md

  • Run node --test local-agent/reliable-outbox.test.mjs local-agent/master-task-timeout.test.mjs.

  • Run npx eslint src/lib/boss-data.ts local-agent/server.mjs local-agent/codex-app-server-runner.mjs local-agent/reliable-outbox.mjs.

  • Run npm run build.

  • Run npm run lint.

  • Document the B+ reliability shell and the local Edge direction in the runtime status doc.