chore: checkpoint Boss app v2.5.11
This commit is contained in:
67
docs/superpowers/plans/2026-06-06-boss-edge-reliability.md
Normal file
67
docs/superpowers/plans/2026-06-06-boss-edge-reliability.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Boss Edge Reliability Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add the first production reliability shell for Boss task execution without changing the deployment topology.
|
||||
|
||||
**Architecture:** Keep Boss Cloud and the current `local-agent`, but make `local-agent` behave like a lightweight Boss Edge by adding a durable outbox and explicit task phases. Cloud-side task APIs keep leases and add watchdog cleanup so APP progress never stays ambiguous forever.
|
||||
|
||||
**Tech Stack:** Next.js API routes, file-backed Boss state, Node local-agent, Codex App Server runner, Node test runner.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Task Phase Contract
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Add `MasterAgentTaskPhase` and normalized fields on `MasterAgentTask`: `phase`, `lastProgressAt`, `lastErrorCode`, `recoverable`, `nextRetryAt`.
|
||||
- [ ] Update task normalization so old state files default `queued -> queued`, `running -> claimed`, terminal states preserve terminal phase.
|
||||
- [ ] Update execution progress card generation to derive step status from phase when available.
|
||||
- [ ] Test that `executor_starting`, `turn_started`, `awaiting_reply`, `completing`, and `recoverable_failed` map to visible progress steps.
|
||||
|
||||
### Task 2: Local Agent Durable Outbox
|
||||
|
||||
**Files:**
|
||||
- Create: `local-agent/reliable-outbox.mjs`
|
||||
- Modify: `local-agent/server.mjs`
|
||||
- Test: `local-agent/reliable-outbox.test.mjs`
|
||||
|
||||
- [ ] Implement JSONL-backed outbox with append, list pending, mark sent, and compaction.
|
||||
- [ ] Wrap `postMasterAgentTaskProgress`, `completeMasterAgentTask`, and `postAppLog` so payloads are persisted before network send.
|
||||
- [ ] Replay pending records on startup and every heartbeat loop.
|
||||
- [ ] Preserve idempotency keys using `taskId + event kind + phase + createdAt`.
|
||||
|
||||
### Task 3: Cloud Watchdog
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Add a lightweight watchdog function invoked during claim, progress, complete, and heartbeat-derived writes.
|
||||
- [ ] Expire stale user conversation tasks older than 1 hour while still queued.
|
||||
- [ ] Convert stale running tasks without progress into `recoverable_failed` if turn has not started, otherwise `timed_out`.
|
||||
- [ ] Ensure late complete cannot overwrite terminal states.
|
||||
|
||||
### Task 4: Executor Health Grading
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Modify: `local-agent/codex-app-server-runner.mjs`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Derive `codexAppServerHealth` as `available / degraded / unavailable` from heartbeat metadata and recent errors.
|
||||
- [ ] Allow GUI-preferred task claim only when health is not `unavailable`.
|
||||
- [ ] Mark app-server stdio closed and timeout errors as degraded for the next heartbeat.
|
||||
|
||||
### Task 5: Verification
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/architecture/current_runtime_and_deploy_status_cn.md`
|
||||
|
||||
- [ ] Run `node --test local-agent/reliable-outbox.test.mjs local-agent/master-task-timeout.test.mjs`.
|
||||
- [ ] Run `npx eslint src/lib/boss-data.ts local-agent/server.mjs local-agent/codex-app-server-runner.mjs local-agent/reliable-outbox.mjs`.
|
||||
- [ ] Run `npm run build`.
|
||||
- [ ] Run `npm run lint`.
|
||||
- [ ] Document the B+ reliability shell and the local Edge direction in the runtime status doc.
|
||||
Reference in New Issue
Block a user