chore: checkpoint Boss app v2.5.11
This commit is contained in:
67
docs/superpowers/plans/2026-06-06-boss-edge-reliability.md
Normal file
67
docs/superpowers/plans/2026-06-06-boss-edge-reliability.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Boss Edge Reliability Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add the first production reliability shell for Boss task execution without changing the deployment topology.
|
||||
|
||||
**Architecture:** Keep Boss Cloud and the current `local-agent`, but make `local-agent` behave like a lightweight Boss Edge by adding a durable outbox and explicit task phases. Cloud-side task APIs keep leases and add watchdog cleanup so APP progress never stays ambiguous forever.
|
||||
|
||||
**Tech Stack:** Next.js API routes, file-backed Boss state, Node local-agent, Codex App Server runner, Node test runner.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Task Phase Contract
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Add `MasterAgentTaskPhase` and normalized fields on `MasterAgentTask`: `phase`, `lastProgressAt`, `lastErrorCode`, `recoverable`, `nextRetryAt`.
|
||||
- [ ] Update task normalization so old state files default `queued -> queued`, `running -> claimed`, terminal states preserve terminal phase.
|
||||
- [ ] Update execution progress card generation to derive step status from phase when available.
|
||||
- [ ] Test that `executor_starting`, `turn_started`, `awaiting_reply`, `completing`, and `recoverable_failed` map to visible progress steps.
|
||||
|
||||
### Task 2: Local Agent Durable Outbox
|
||||
|
||||
**Files:**
|
||||
- Create: `local-agent/reliable-outbox.mjs`
|
||||
- Modify: `local-agent/server.mjs`
|
||||
- Test: `local-agent/reliable-outbox.test.mjs`
|
||||
|
||||
- [ ] Implement JSONL-backed outbox with append, list pending, mark sent, and compaction.
|
||||
- [ ] Wrap `postMasterAgentTaskProgress`, `completeMasterAgentTask`, and `postAppLog` so payloads are persisted before network send.
|
||||
- [ ] Replay pending records on startup and every heartbeat loop.
|
||||
- [ ] Preserve idempotency keys using `taskId + event kind + phase + createdAt`.
|
||||
|
||||
### Task 3: Cloud Watchdog
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Add a lightweight watchdog function invoked during claim, progress, complete, and heartbeat-derived writes.
|
||||
- [ ] Expire stale user conversation tasks older than 1 hour while still queued.
|
||||
- [ ] Convert stale running tasks without progress into `recoverable_failed` if turn has not started, otherwise `timed_out`.
|
||||
- [ ] Ensure late complete cannot overwrite terminal states.
|
||||
|
||||
### Task 4: Executor Health Grading
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/lib/boss-data.ts`
|
||||
- Modify: `local-agent/codex-app-server-runner.mjs`
|
||||
- Test: `src/lib/boss-data-reliability.test.ts`
|
||||
|
||||
- [ ] Derive `codexAppServerHealth` as `available / degraded / unavailable` from heartbeat metadata and recent errors.
|
||||
- [ ] Allow GUI-preferred task claim only when health is not `unavailable`.
|
||||
- [ ] Mark app-server stdio closed and timeout errors as degraded for the next heartbeat.
|
||||
|
||||
### Task 5: Verification
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/architecture/current_runtime_and_deploy_status_cn.md`
|
||||
|
||||
- [ ] Run `node --test local-agent/reliable-outbox.test.mjs local-agent/master-task-timeout.test.mjs`.
|
||||
- [ ] Run `npx eslint src/lib/boss-data.ts local-agent/server.mjs local-agent/codex-app-server-runner.mjs local-agent/reliable-outbox.mjs`.
|
||||
- [ ] Run `npm run build`.
|
||||
- [ ] Run `npm run lint`.
|
||||
- [ ] Document the B+ reliability shell and the local Edge direction in the runtime status doc.
|
||||
@@ -0,0 +1,107 @@
|
||||
# Boss Edge Reliability Design
|
||||
|
||||
## Goal
|
||||
|
||||
把 Boss 的远程开发控制链路升级为“云端控制面 + 本地 Edge 执行面 + 可靠性外壳”。核心目标是避免企业客户在 APP 发起任务后看到长期卡住、丢消息、重复执行或错误泄露。
|
||||
|
||||
## Problem
|
||||
|
||||
本次 `juyuwan` 会话卡在第一步暴露了四个系统性问题:
|
||||
|
||||
- 本地 `local-agent` 被 Codex App Server stdio `EPIPE` 打断后会重启,但任务状态没有被本地 durable journal 接住。
|
||||
- 云端任务状态只有粗粒度 `queued / running / completed / failed`,APP 无法准确区分“等待执行器”“执行器已启动”“Codex turn 已启动”“完成回写中”。
|
||||
- 实时 progress 回写失败只是日志告警,缺少本地 outbox 重放。
|
||||
- 执行器可用性目前偏 heartbeat 描述,未形成任务调度前的健康分级。
|
||||
|
||||
## Recommended Architecture
|
||||
|
||||
采用 B+ 方案:
|
||||
|
||||
```text
|
||||
Boss APP
|
||||
-> 优先连接企业内网 Boss Edge
|
||||
-> Edge 不可达时回退 Boss Cloud
|
||||
|
||||
Boss Edge
|
||||
-> 接收本企业任务
|
||||
-> 维护本地 task journal / outbox / progress stream
|
||||
-> 调度 boss-agent / Codex App Server / Codex CLI / Computer Use
|
||||
-> 与云端做结果、审计、备份同步
|
||||
|
||||
Boss Cloud
|
||||
-> 账号、授权、企业后台、审计归档
|
||||
-> OTA、Skill 分发、跨企业总控
|
||||
-> 任务租约、watchdog、恢复策略
|
||||
```
|
||||
|
||||
第一阶段不引入独立服务器进程,先让当前 `local-agent` 具备 Edge 行为:本地持久 outbox、执行阶段上报、重放、可恢复失败语义。后续企业部署时再拆成独立 `boss-edge` 服务。
|
||||
|
||||
## Reliability Contract
|
||||
|
||||
### Task phases
|
||||
|
||||
任务需要区分状态和阶段:
|
||||
|
||||
- `queued`:云端已创建,等待设备认领。
|
||||
- `claimed`:设备已认领,尚未启动执行器。
|
||||
- `executor_starting`:设备正在准备 Codex App Server / CLI / Computer Use。
|
||||
- `turn_started`:目标 Codex turn 或本地执行动作已启动。
|
||||
- `awaiting_reply`:执行器已接管,等待最终结果。
|
||||
- `completing`:本地已拿到结果,正在回写云端。
|
||||
- `completed`:云端已持久化最终结果。
|
||||
- `recoverable_failed`:失败可重试,不允许静默卡住。
|
||||
- `terminal_failed`:失败不可自动重试,需要用户或管理员处理。
|
||||
- `timed_out`:任务超过租约或执行超时。
|
||||
- `canceled`:用户或系统取消。
|
||||
|
||||
### Outbox
|
||||
|
||||
`local-agent` 所有关键回写先写本地 outbox,再发送云端:
|
||||
|
||||
- `task.progress`
|
||||
- `task.complete`
|
||||
- `app.log`
|
||||
|
||||
发送成功后标记 sent。网络失败、云端 5xx、进程重启后自动重放。云端 complete 必须保持幂等,迟到 complete 不覆盖终态。
|
||||
|
||||
### Watchdog
|
||||
|
||||
云端每次 claim、progress、complete 和 heartbeat 时都执行轻量 watchdog:
|
||||
|
||||
- `queued` 超过 1 小时的用户对话任务转 `timed_out`,避免历史任务被修复后误执行。
|
||||
- `running` 超过 lease 且无 progress 的任务转 `recoverable_failed` 或 `timed_out`。
|
||||
- `turn_started` 后失败不能自动转 CLI 重试,必须提示“可继续等待 / 中断 / 重新下发”。
|
||||
|
||||
### Health grading
|
||||
|
||||
设备能力从布尔值升级为分级:
|
||||
|
||||
- `available`:最近 heartbeat 正常,App Server 初始化成功,目标线程操作可用。
|
||||
- `degraded`:设备在线但 App Server discovery 有失败,允许低风险任务,重任务需降级提示。
|
||||
- `unavailable`:设备离线、未登录、App Server 断连或连续失败。
|
||||
|
||||
调度优先级:健康 Codex App Server -> CLI fallback -> 用户 API fallback -> 明确提示无可用模型渠道。
|
||||
|
||||
## Security Rules
|
||||
|
||||
- 不把系统提示词、内部 prompt、API key、本地绝对路径、原始命令输出、raw App Server item 写进用户可见错误。
|
||||
- 后台只保存错误 code、阶段、设备、任务 ID、安全摘要。
|
||||
- APP 只显示人话和下一步动作。
|
||||
|
||||
## First Implementation Slice
|
||||
|
||||
本批改造只做不改变部署形态的可靠性底座:
|
||||
|
||||
1. 给 `MasterAgentTask` 增加 `phase / lastProgressAt / lastErrorCode / recoverable / nextRetryAt` 等字段。
|
||||
2. 进度卡从 task phase 派生步骤状态,不再只靠默认 index。
|
||||
3. `local-agent` 增加 outbox 文件和重放逻辑,覆盖 progress、complete 和 app-log。
|
||||
4. 云端 claim/progress/complete 路径增加 watchdog 清理。
|
||||
5. 补 Node 测试覆盖 EPIPE、outbox 重放、stale running、旧 queued 清理、重复 complete 幂等。
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- APP 不再出现无限停在第一步;最差也会进入“执行器恢复中 / 可重试 / 已超时”。
|
||||
- 本地 agent 重启后未发送的 progress/complete 会自动重放。
|
||||
- 历史 queued 任务不会在修复后误执行。
|
||||
- Codex turn 已启动后不会被自动重复下发。
|
||||
- 所有错误输出经过脱敏,不泄露内部 prompt。
|
||||
Reference in New Issue
Block a user