feat: add browser-assisted douyin capture flow
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -20,6 +20,8 @@ build/
|
||||
.kotlin/
|
||||
**/.gradle/
|
||||
**/.kotlin/
|
||||
node_modules/
|
||||
**/node_modules/
|
||||
|
||||
# Runtime data and artifacts
|
||||
data/
|
||||
|
||||
18
README.md
18
README.md
@@ -19,6 +19,24 @@ cd /Users/kris/code/StoryForge-gitea/android-app
|
||||
./gradlew assembleDebug
|
||||
```
|
||||
|
||||
## Douyin Browser Capture
|
||||
|
||||
```bash
|
||||
cd /Users/kris/code/StoryForge-gitea/scripts/douyin-browser-capture
|
||||
npm install
|
||||
npx playwright install chromium
|
||||
npm run capture -- \
|
||||
--profile-url https://www.douyin.com/user/your_account \
|
||||
--storyforge-username kris \
|
||||
--storyforge-password 'Asd123456.'
|
||||
```
|
||||
|
||||
说明:
|
||||
|
||||
- 这是“真实浏览器 + 人工登录/过挑战 + 自动提取 + 回写 StoryForge”的辅助采集工具
|
||||
- 默认输出到 `output/playwright/douyin/`
|
||||
- 详细说明见 `scripts/douyin-browser-capture/README.md`
|
||||
|
||||
## Collector Service
|
||||
|
||||
```bash
|
||||
|
||||
@@ -116,6 +116,9 @@
|
||||
- public 页面命中抖音反爬挑战时的显式诊断返回
|
||||
- 真实 smoke 结果表明,纯 public 主页抓取会落到 `byted_acrawler` 挑战页,而不是正常 profile 数据页
|
||||
- 同时,`manual_profile_payload + manual_work_payloads` 已验证可完成账号入库、分析报告生成、相似账号搜索和对标关系写入
|
||||
- 现已新增浏览器辅助采集工具 `/Users/kris/code/StoryForge-gitea/scripts/douyin-browser-capture/capture_and_sync.mjs`
|
||||
- 该工具使用真实 Playwright Chromium 会话打开抖音页面,允许人工登录 / 过滑块后继续自动提取 `<script>` JSON、网络 JSON、视频详情页和创作者中心页数据
|
||||
- 浏览器工具最终直接调用现有 `/v2/douyin/accounts/sync`,不新增第二套持久化模型
|
||||
|
||||
结论:`douyin` 方向不再是“接口存在但不可用”,当前状态是“public 直抓受反爬限制,但人工采集兜底链已跑通”。
|
||||
|
||||
@@ -181,6 +184,6 @@
|
||||
## 当前主要风险
|
||||
|
||||
1. 小红书账号级内容源还未做真实平台验证
|
||||
2. `douyin` public 直抓仍受反爬限制,生产落地还需要补 cookie 或人工页面采集协作链
|
||||
2. `douyin` public 直抓仍受反爬限制,但现在已经有“真实浏览器 + 人工登录 + 自动提取 + 回写现有工作台”的可落地协作链
|
||||
3. `huobao-drama-upstream` 已完成代码迁移并可编译,但 fresh smoke 受外部图片/视频凭证 `403 invalid user` 阻塞
|
||||
4. Android 端目前已能完成 Debug APK 构建,但仍缺少真机安装和功能回归验证
|
||||
|
||||
@@ -158,6 +158,26 @@ docker compose up -d --build
|
||||
- 相似账号搜索:`dysearch_c247b75db0df49429a1d127407fe4486`
|
||||
- 对标关系:`dyrel_c8df266341e74237b99c880eb4b572d8`
|
||||
|
||||
浏览器辅助采集:
|
||||
|
||||
```bash
|
||||
cd /Users/kris/code/StoryForge-gitea/scripts/douyin-browser-capture
|
||||
npm install
|
||||
npx playwright install chromium
|
||||
npm run capture -- \
|
||||
--profile-url https://www.douyin.com/user/your_account \
|
||||
--storyforge-username kris \
|
||||
--storyforge-password 'Asd123456.'
|
||||
```
|
||||
|
||||
说明:
|
||||
|
||||
- 脚本会打开真实 Chromium 会话,默认复用 `~/.storyforge/douyin-playwright` 登录态
|
||||
- 如果出现扫码登录、滑块或挑战页,先在浏览器里人工完成,再回终端继续
|
||||
- 脚本会保存 `profile-bundle.json`、`storyforge-sync-request.json` 和同步响应
|
||||
- 当前已完成 headless 最小 smoke,输出目录:
|
||||
- `/tmp/storyforge-douyin-capture-smoke/2026-03-20T06-49-37.705Z-storyforge_test_001`
|
||||
|
||||
## 7. `cutvideo` 实拍剪辑链路验证
|
||||
|
||||
调用 `POST /v2/pipelines/real-cut`
|
||||
|
||||
@@ -23,6 +23,7 @@
|
||||
- live `collector` 已挂出 `/v2/douyin/*` 能力并通过认证接口验证
|
||||
- `douyin` 支持从分享文案中提取 `profile_url`,并在 public 页面命中抖音反爬挑战时返回明确诊断
|
||||
- `douyin` 手工 payload 导入与账号分析链路已跑通
|
||||
- `douyin` 浏览器辅助采集工具已接入,可用真实 Playwright Chromium 会话采集主页 / 视频页并直接调用现有 `/v2/douyin/accounts/sync`
|
||||
- 本机 `huobao-drama` API 调度、首尾帧生成、视频生成与结果回写接口
|
||||
- FastGPT 运行时依赖删除
|
||||
- 旧 FastGPT 运行残留容器已实际下线
|
||||
@@ -45,11 +46,13 @@
|
||||
- `huobao-upstream` 隔离 smoke 剧本:`drama_id=11` (`http://127.0.0.1:5681`)
|
||||
- `huobao-upstream` 隔离 smoke 启动脚本:`/Users/kris/code/huobao-drama-upstream/scripts/run_storyforge_smoke.sh`
|
||||
- Android Debug APK:`/Users/kris/code/StoryForge-gitea/android-app/app/build/outputs/apk/debug/app-debug.apk`
|
||||
- `douyin` 浏览器采集最小 smoke:`/tmp/storyforge-douyin-capture-smoke/2026-03-20T06-49-37.705Z-storyforge_test_001`
|
||||
|
||||
## 尚未完全跑通
|
||||
|
||||
- 小红书账号级内容源还未做真实平台验证
|
||||
- `douyin` public 主页直抓会命中 `public_profile_anti_bot_challenge`;当前已验证手工 payload 导入、分析、相似账号搜索和对标关系可作为可用兜底路径
|
||||
- `douyin` 浏览器辅助采集已经能真实输出 `profile-bundle.json / storyforge-sync-request.json`,但要拿到有效主页数据仍需要用户在浏览器里完成登录或挑战校验
|
||||
- `huobao-upstream` 已能全量编译;并且旧改版隔离实例也已重放确认,当前 fresh 生成被外部图片/视频凭证统一返回 `403 invalid user`
|
||||
- `huobao-upstream` 已新增 `HUOBAO_TEXT_* / HUOBAO_IMAGE_* / HUOBAO_VIDEO_*` 运行时覆盖能力,后续补新 key 可直接接管数据库配置
|
||||
- Android Debug 包已可本地构建,但尚未完成真机安装验证
|
||||
|
||||
52
scripts/douyin-browser-capture/README.md
Normal file
52
scripts/douyin-browser-capture/README.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Douyin Browser Capture
|
||||
|
||||
This tool drives a real Playwright Chromium session, lets a human log into Douyin, captures the loaded profile and work pages, and can sync the captured bundle into StoryForge's existing `/v2/douyin/accounts/sync` endpoint.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
cd /Users/kris/code/StoryForge-gitea/scripts/douyin-browser-capture
|
||||
npm install
|
||||
npx playwright install chromium
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
cd /Users/kris/code/StoryForge-gitea/scripts/douyin-browser-capture
|
||||
npm run capture -- \
|
||||
--profile-url https://www.douyin.com/user/your_account \
|
||||
--storyforge-username kris \
|
||||
--storyforge-password 'Asd123456.'
|
||||
```
|
||||
|
||||
The browser uses a persistent state directory under `~/.storyforge/douyin-playwright`, so Douyin login can survive between runs.
|
||||
|
||||
## What it captures
|
||||
|
||||
- current profile page JSON blobs extracted from `<script>` tags
|
||||
- selected window globals such as `__INITIAL_STATE__`
|
||||
- relevant JSON network responses
|
||||
- creator-center pages using the same logged-in browser context
|
||||
- a limited number of video detail pages linked from the profile
|
||||
|
||||
## Output
|
||||
|
||||
Default output directory:
|
||||
|
||||
`/Users/kris/code/StoryForge-gitea/output/playwright/douyin`
|
||||
|
||||
Each run writes:
|
||||
|
||||
- `profile-bundle.json`
|
||||
- `creator-*.json`
|
||||
- `video-*.json`
|
||||
- `storyforge-sync-request.json`
|
||||
- `storyforge-sync-response.json` when sync is enabled
|
||||
- `summary.json`
|
||||
|
||||
## Notes
|
||||
|
||||
- This is designed as a browser-assisted capture flow, not a fully headless anti-bot bypass.
|
||||
- If Douyin shows a slider or challenge page, solve it manually in the opened browser window and then continue.
|
||||
- Use `--no-sync` if you only want to save a local bundle for inspection.
|
||||
683
scripts/douyin-browser-capture/capture_and_sync.mjs
Normal file
683
scripts/douyin-browser-capture/capture_and_sync.mjs
Normal file
@@ -0,0 +1,683 @@
|
||||
#!/usr/bin/env node
|
||||
|
||||
import fs from "node:fs/promises";
|
||||
import os from "node:os";
|
||||
import path from "node:path";
|
||||
import process from "node:process";
|
||||
import readline from "node:readline/promises";
|
||||
import { stdin as input, stdout as output } from "node:process";
|
||||
import { chromium } from "playwright";
|
||||
|
||||
const DEFAULT_CREATOR_CENTER_URLS = [
|
||||
"https://creator.douyin.com/creator-micro/home",
|
||||
"https://creator.douyin.com/creator-micro/data",
|
||||
"https://creator.douyin.com/creator-micro/content/manage"
|
||||
];
|
||||
const DEFAULT_OUTPUT_DIR = "/Users/kris/code/StoryForge-gitea/output/playwright/douyin";
|
||||
const DEFAULT_STATE_DIR = path.join(os.homedir(), ".storyforge", "douyin-playwright");
|
||||
const DEFAULT_BACKEND_URL = "http://127.0.0.1:8081";
|
||||
const JSON_CAPTURE_LIMIT = 1_500_000;
|
||||
const SCRIPT_SCAN_LIMIT = 2_000_000;
|
||||
const WAIT_AFTER_NAV_MS = 4_000;
|
||||
const RESPONSE_READ_TIMEOUT_MS = 2_000;
|
||||
|
||||
function printHelp() {
|
||||
console.log(`StoryForge Douyin Browser Capture
|
||||
|
||||
Usage:
|
||||
node capture_and_sync.mjs --profile-url <douyin-profile-url> [options]
|
||||
|
||||
Core options:
|
||||
--profile-url <url> Douyin profile URL to capture
|
||||
--backend-url <url> StoryForge collector base URL (default: ${DEFAULT_BACKEND_URL})
|
||||
--output-dir <dir> Capture output directory (default: ${DEFAULT_OUTPUT_DIR})
|
||||
--state-dir <dir> Persistent browser state dir (default: ${DEFAULT_STATE_DIR})
|
||||
--max-videos <n> Max video detail pages to capture (default: 4)
|
||||
--scroll-count <n> Scroll times on profile page (default: 5)
|
||||
--wait-ms <n> Wait after each navigation in ms (default: ${WAIT_AFTER_NAV_MS})
|
||||
|
||||
StoryForge auth:
|
||||
--storyforge-token <token> Existing StoryForge bearer token
|
||||
--storyforge-username <name> Login username for StoryForge
|
||||
--storyforge-password <pass> Login password for StoryForge
|
||||
|
||||
Mode flags:
|
||||
--headless Run browser headless
|
||||
--skip-login-prompt Do not pause for manual login / captcha completion
|
||||
--no-sync Capture only, do not import into StoryForge
|
||||
--no-creator-center Skip creator-center page capture
|
||||
--note <text> Discovery note saved into StoryForge
|
||||
|
||||
Examples:
|
||||
npm run capture -- \\
|
||||
--profile-url https://www.douyin.com/user/your_account \\
|
||||
--storyforge-username kris --storyforge-password 'Asd123456.'
|
||||
|
||||
npm run capture -- \\
|
||||
--profile-url https://www.douyin.com/user/your_account \\
|
||||
--storyforge-token <token> --headless --skip-login-prompt --no-creator-center
|
||||
`);
|
||||
}
|
||||
|
||||
function parseArgs(argv) {
|
||||
const options = {
|
||||
backendUrl: DEFAULT_BACKEND_URL,
|
||||
outputDir: DEFAULT_OUTPUT_DIR,
|
||||
stateDir: DEFAULT_STATE_DIR,
|
||||
maxVideos: 4,
|
||||
scrollCount: 5,
|
||||
waitMs: WAIT_AFTER_NAV_MS,
|
||||
headless: false,
|
||||
manualPrompt: true,
|
||||
syncEnabled: true,
|
||||
creatorCenterEnabled: true,
|
||||
creatorCenterUrls: [...DEFAULT_CREATOR_CENTER_URLS],
|
||||
note: "",
|
||||
profileUrl: "",
|
||||
storyforgeToken: "",
|
||||
storyforgeUsername: "",
|
||||
storyforgePassword: ""
|
||||
};
|
||||
|
||||
const requireValue = (index, flag) => {
|
||||
const value = argv[index + 1];
|
||||
if (!value || value.startsWith("--")) {
|
||||
throw new Error(`Missing value for ${flag}`);
|
||||
}
|
||||
return value;
|
||||
};
|
||||
|
||||
for (let index = 0; index < argv.length; index += 1) {
|
||||
const arg = argv[index];
|
||||
switch (arg) {
|
||||
case "--help":
|
||||
case "-h":
|
||||
options.help = true;
|
||||
break;
|
||||
case "--profile-url":
|
||||
options.profileUrl = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--backend-url":
|
||||
options.backendUrl = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--output-dir":
|
||||
options.outputDir = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--state-dir":
|
||||
options.stateDir = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--max-videos":
|
||||
options.maxVideos = Number.parseInt(requireValue(index, arg), 10);
|
||||
index += 1;
|
||||
break;
|
||||
case "--scroll-count":
|
||||
options.scrollCount = Number.parseInt(requireValue(index, arg), 10);
|
||||
index += 1;
|
||||
break;
|
||||
case "--wait-ms":
|
||||
options.waitMs = Number.parseInt(requireValue(index, arg), 10);
|
||||
index += 1;
|
||||
break;
|
||||
case "--storyforge-token":
|
||||
options.storyforgeToken = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--storyforge-username":
|
||||
options.storyforgeUsername = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--storyforge-password":
|
||||
options.storyforgePassword = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--note":
|
||||
options.note = requireValue(index, arg);
|
||||
index += 1;
|
||||
break;
|
||||
case "--headless":
|
||||
options.headless = true;
|
||||
break;
|
||||
case "--skip-login-prompt":
|
||||
options.manualPrompt = false;
|
||||
break;
|
||||
case "--no-sync":
|
||||
options.syncEnabled = false;
|
||||
break;
|
||||
case "--no-creator-center":
|
||||
options.creatorCenterEnabled = false;
|
||||
break;
|
||||
default:
|
||||
throw new Error(`Unknown argument: ${arg}`);
|
||||
}
|
||||
}
|
||||
|
||||
return options;
|
||||
}
|
||||
|
||||
function sanitizeName(value) {
|
||||
return String(value || "capture")
|
||||
.replace(/[^a-zA-Z0-9._-]+/g, "-")
|
||||
.replace(/-+/g, "-")
|
||||
.replace(/^-|-$/g, "")
|
||||
.slice(0, 80) || "capture";
|
||||
}
|
||||
|
||||
async function ensureDir(dir) {
|
||||
await fs.mkdir(dir, { recursive: true });
|
||||
}
|
||||
|
||||
function nowStamp() {
|
||||
return new Date().toISOString().replace(/[:]/g, "-");
|
||||
}
|
||||
|
||||
function sleep(ms) {
|
||||
return new Promise((resolve) => setTimeout(resolve, ms));
|
||||
}
|
||||
|
||||
async function navigateAndSettle(page, url, waitMs) {
|
||||
await page.goto(url, { waitUntil: "commit", timeout: 30_000 }).catch(() => null);
|
||||
await page.waitForLoadState("domcontentloaded", { timeout: 15_000 }).catch(() => {});
|
||||
await sleep(waitMs);
|
||||
}
|
||||
|
||||
async function maybePrompt(message, enabled) {
|
||||
if (!enabled) {
|
||||
return;
|
||||
}
|
||||
const rl = readline.createInterface({ input, output });
|
||||
try {
|
||||
await rl.question(`${message}\nPress Enter to continue... `);
|
||||
} finally {
|
||||
rl.close();
|
||||
}
|
||||
}
|
||||
|
||||
function uniqueStrings(values) {
|
||||
const seen = new Set();
|
||||
const output = [];
|
||||
for (const value of values) {
|
||||
const item = String(value || "").trim();
|
||||
if (!item || seen.has(item)) {
|
||||
continue;
|
||||
}
|
||||
seen.add(item);
|
||||
output.push(item);
|
||||
}
|
||||
return output;
|
||||
}
|
||||
|
||||
function looksLikeRelevantJsonUrl(url) {
|
||||
const lower = url.toLowerCase();
|
||||
return (
|
||||
lower.includes("douyin.com/aweme") ||
|
||||
lower.includes("douyin.com/web/api") ||
|
||||
lower.includes("douyin.com/creator") ||
|
||||
lower.includes("douyin.com/user") ||
|
||||
lower.includes("creator.douyin.com") ||
|
||||
lower.includes("iesdouyin.com")
|
||||
);
|
||||
}
|
||||
|
||||
function findJsonEnd(text, start) {
|
||||
const opening = text[start];
|
||||
const closing = opening === "{" ? "}" : "]";
|
||||
let depth = 0;
|
||||
let inString = false;
|
||||
let escaped = false;
|
||||
|
||||
for (let index = start; index < text.length; index += 1) {
|
||||
const char = text[index];
|
||||
if (inString) {
|
||||
if (escaped) {
|
||||
escaped = false;
|
||||
} else if (char === "\\") {
|
||||
escaped = true;
|
||||
} else if (char === "\"") {
|
||||
inString = false;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
if (char === "\"") {
|
||||
inString = true;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (char === opening) {
|
||||
depth += 1;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (char === closing) {
|
||||
depth -= 1;
|
||||
if (depth === 0) {
|
||||
return index + 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return -1;
|
||||
}
|
||||
|
||||
async function createResponseCapture(page) {
|
||||
const records = [];
|
||||
const seen = new Set();
|
||||
const pending = [];
|
||||
|
||||
const listener = (response) => {
|
||||
const promise = (async () => {
|
||||
try {
|
||||
const url = response.url();
|
||||
const headers = response.headers();
|
||||
const contentType = String(headers["content-type"] || "").toLowerCase();
|
||||
if (!contentType.includes("json") && !looksLikeRelevantJsonUrl(url)) {
|
||||
return;
|
||||
}
|
||||
const key = `${response.request().method()} ${url}`;
|
||||
if (seen.has(key)) {
|
||||
return;
|
||||
}
|
||||
const text = await Promise.race([
|
||||
response.text(),
|
||||
sleep(RESPONSE_READ_TIMEOUT_MS).then(() => {
|
||||
throw new Error("response read timeout");
|
||||
})
|
||||
]);
|
||||
if (!text || text.length > JSON_CAPTURE_LIMIT) {
|
||||
return;
|
||||
}
|
||||
let payload = null;
|
||||
try {
|
||||
payload = JSON.parse(text);
|
||||
} catch {
|
||||
return;
|
||||
}
|
||||
seen.add(key);
|
||||
records.push({
|
||||
url,
|
||||
method: response.request().method(),
|
||||
status: response.status(),
|
||||
payload
|
||||
});
|
||||
} catch {
|
||||
// Ignore network capture failures; page-level capture is still useful.
|
||||
}
|
||||
})();
|
||||
pending.push(promise);
|
||||
};
|
||||
|
||||
page.on("response", listener);
|
||||
return {
|
||||
records,
|
||||
async stop() {
|
||||
page.off("response", listener);
|
||||
await Promise.race([
|
||||
Promise.allSettled(pending),
|
||||
sleep(RESPONSE_READ_TIMEOUT_MS + 500)
|
||||
]);
|
||||
return records;
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
function extractJsonObjectsFromText(text) {
|
||||
const candidates = [text];
|
||||
const seen = new Set();
|
||||
const results = [];
|
||||
|
||||
for (const candidate of candidates) {
|
||||
const snippet = String(candidate || "").slice(0, SCRIPT_SCAN_LIMIT);
|
||||
for (let index = 0; index < snippet.length; index += 1) {
|
||||
const char = snippet[index];
|
||||
if (char !== "{" && char !== "[") {
|
||||
continue;
|
||||
}
|
||||
const end = findJsonEnd(snippet, index);
|
||||
if (end <= index) {
|
||||
continue;
|
||||
}
|
||||
try {
|
||||
const parsed = JSON.parse(snippet.slice(index, end));
|
||||
const marker = JSON.stringify(parsed);
|
||||
if (seen.has(marker)) {
|
||||
continue;
|
||||
}
|
||||
seen.add(marker);
|
||||
results.push(parsed);
|
||||
if (results.length >= 50) {
|
||||
return results;
|
||||
}
|
||||
} catch {
|
||||
// Keep scanning.
|
||||
}
|
||||
}
|
||||
}
|
||||
return results;
|
||||
}
|
||||
|
||||
function extractScriptPayloads(html) {
|
||||
const results = [];
|
||||
const seen = new Set();
|
||||
const regex = /<script([^>]*)>([\s\S]*?)<\/script>/gi;
|
||||
let match = null;
|
||||
while ((match = regex.exec(html)) !== null) {
|
||||
const attrs = match[1] || "";
|
||||
const content = match[2] || "";
|
||||
const idMatch = attrs.match(/id=["']([^"']+)["']/i);
|
||||
const scriptId = idMatch ? idMatch[1] : "";
|
||||
for (const payload of extractJsonObjectsFromText(content.trim())) {
|
||||
const marker = JSON.stringify(payload);
|
||||
if (seen.has(marker)) {
|
||||
continue;
|
||||
}
|
||||
seen.add(marker);
|
||||
results.push({ script_id: scriptId, payload });
|
||||
}
|
||||
}
|
||||
return results;
|
||||
}
|
||||
|
||||
async function collectWindowGlobals(page) {
|
||||
return page.evaluate(() => {
|
||||
const globalNames = [
|
||||
"__INITIAL_STATE__",
|
||||
"__NEXT_DATA__",
|
||||
"__ROUTER_DATA__",
|
||||
"SIGI_STATE",
|
||||
"__APOLLO_STATE__"
|
||||
];
|
||||
const result = {};
|
||||
for (const name of globalNames) {
|
||||
const value = globalThis[name];
|
||||
if (value === undefined) {
|
||||
continue;
|
||||
}
|
||||
try {
|
||||
result[name] = JSON.parse(JSON.stringify(value));
|
||||
} catch {
|
||||
// Skip non-serializable globals.
|
||||
}
|
||||
}
|
||||
return result;
|
||||
});
|
||||
}
|
||||
|
||||
async function collectVideoLinks(page) {
|
||||
const hrefs = await page.evaluate(() => {
|
||||
return Array.from(document.querySelectorAll("a[href]"))
|
||||
.map((node) => node.getAttribute("href") || "")
|
||||
.filter(Boolean);
|
||||
});
|
||||
return uniqueStrings(
|
||||
hrefs
|
||||
.map((href) => {
|
||||
if (href.startsWith("//")) {
|
||||
return `https:${href}`;
|
||||
}
|
||||
if (href.startsWith("/")) {
|
||||
return `https://www.douyin.com${href}`;
|
||||
}
|
||||
return href;
|
||||
})
|
||||
.filter((href) => href.includes("/video/"))
|
||||
);
|
||||
}
|
||||
|
||||
async function clickFirstVisible(page, selectors) {
|
||||
for (const selector of selectors) {
|
||||
const locator = page.locator(selector).first();
|
||||
try {
|
||||
if (await locator.isVisible({ timeout: 1000 })) {
|
||||
await locator.click({ timeout: 1000 });
|
||||
return true;
|
||||
}
|
||||
} catch {
|
||||
// Try next selector.
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
async function prepareProfilePage(page, options) {
|
||||
await clickFirstVisible(page, [
|
||||
"text=作品",
|
||||
"text=视频",
|
||||
"text=全部作品",
|
||||
"[role='tab']:has-text('作品')"
|
||||
]);
|
||||
|
||||
for (let index = 0; index < 3; index += 1) {
|
||||
await clickFirstVisible(page, [
|
||||
"text=展开",
|
||||
"text=更多",
|
||||
"text=查看全部"
|
||||
]);
|
||||
}
|
||||
|
||||
for (let index = 0; index < options.scrollCount; index += 1) {
|
||||
await page.evaluate(() => window.scrollBy(0, window.innerHeight * 0.85));
|
||||
await sleep(1200);
|
||||
}
|
||||
}
|
||||
|
||||
async function capturePageBundle(page, label, responseCapture, extra = {}) {
|
||||
const html = await page.content();
|
||||
const loginGateDetected =
|
||||
html.includes("扫码登录") ||
|
||||
html.includes("验证码登录") ||
|
||||
html.includes("登录后免费畅享高清视频");
|
||||
const antiBotDetected =
|
||||
html.includes("window.byted_acrawler.init") ||
|
||||
html.includes("__ac_signature") ||
|
||||
html.includes("__ac_nonce");
|
||||
const scripts = extractScriptPayloads(html);
|
||||
const globals = await collectWindowGlobals(page);
|
||||
const network = await responseCapture.stop();
|
||||
const bundle = {
|
||||
label,
|
||||
captured_at: new Date().toISOString(),
|
||||
page_url: page.url(),
|
||||
page_title: await page.title().catch(() => ""),
|
||||
page_meta: await page.evaluate(() => ({
|
||||
href: window.location.href,
|
||||
title: document.title,
|
||||
text_excerpt: (document.body?.innerText || "").trim().slice(0, 8000)
|
||||
})),
|
||||
capture_hints: {
|
||||
login_gate_detected: loginGateDetected,
|
||||
anti_bot_detected: antiBotDetected
|
||||
},
|
||||
scripts,
|
||||
globals,
|
||||
network,
|
||||
extra
|
||||
};
|
||||
return bundle;
|
||||
}
|
||||
|
||||
async function saveJson(filePath, value) {
|
||||
await ensureDir(path.dirname(filePath));
|
||||
await fs.writeFile(filePath, JSON.stringify(value, null, 2), "utf8");
|
||||
}
|
||||
|
||||
async function loginStoryForge(baseUrl, username, password) {
|
||||
const response = await fetch(`${baseUrl.replace(/\/$/, "")}/v2/auth/login`, {
|
||||
method: "POST",
|
||||
headers: { "content-type": "application/json" },
|
||||
body: JSON.stringify({ username, password })
|
||||
});
|
||||
if (!response.ok) {
|
||||
throw new Error(`StoryForge login failed: ${response.status} ${await response.text()}`);
|
||||
}
|
||||
return response.json();
|
||||
}
|
||||
|
||||
async function syncCapture(baseUrl, token, body) {
|
||||
const response = await fetch(`${baseUrl.replace(/\/$/, "")}/v2/douyin/accounts/sync`, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"content-type": "application/json",
|
||||
Authorization: `Bearer ${token}`
|
||||
},
|
||||
body: JSON.stringify(body)
|
||||
});
|
||||
const payload = await response.json().catch(async () => ({ raw: await response.text() }));
|
||||
if (!response.ok) {
|
||||
throw new Error(`StoryForge sync failed: ${response.status} ${JSON.stringify(payload)}`);
|
||||
}
|
||||
return payload;
|
||||
}
|
||||
|
||||
async function captureCreatorPages(context, options, runDir) {
|
||||
const pages = [];
|
||||
if (!options.creatorCenterEnabled) {
|
||||
return pages;
|
||||
}
|
||||
|
||||
for (const url of options.creatorCenterUrls) {
|
||||
const page = await context.newPage();
|
||||
const responseCapture = await createResponseCapture(page);
|
||||
try {
|
||||
console.error(`Capturing creator-center page: ${url}`);
|
||||
await navigateAndSettle(page, url, options.waitMs);
|
||||
const bundle = await capturePageBundle(page, "creator_center", responseCapture);
|
||||
pages.push({
|
||||
url: bundle.page_url,
|
||||
title: bundle.page_title,
|
||||
payload: bundle
|
||||
});
|
||||
await saveJson(
|
||||
path.join(runDir, `creator-${sanitizeName(bundle.page_title || bundle.page_url)}.json`),
|
||||
bundle
|
||||
);
|
||||
} finally {
|
||||
await page.close().catch(() => {});
|
||||
}
|
||||
}
|
||||
return pages;
|
||||
}
|
||||
|
||||
async function captureVideoPages(context, videoLinks, options, runDir) {
|
||||
const pages = [];
|
||||
for (const link of videoLinks.slice(0, Math.max(options.maxVideos, 0))) {
|
||||
const page = await context.newPage();
|
||||
const responseCapture = await createResponseCapture(page);
|
||||
try {
|
||||
console.error(`Capturing video page: ${link}`);
|
||||
await navigateAndSettle(page, link, options.waitMs);
|
||||
const bundle = await capturePageBundle(page, "video_detail", responseCapture, { source_link: link });
|
||||
pages.push(bundle);
|
||||
await saveJson(path.join(runDir, `video-${sanitizeName(link)}.json`), bundle);
|
||||
} finally {
|
||||
await page.close().catch(() => {});
|
||||
}
|
||||
}
|
||||
return pages;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
const options = parseArgs(process.argv.slice(2));
|
||||
if (options.help) {
|
||||
printHelp();
|
||||
return;
|
||||
}
|
||||
if (!options.profileUrl) {
|
||||
throw new Error("--profile-url is required");
|
||||
}
|
||||
if (
|
||||
options.syncEnabled &&
|
||||
!options.storyforgeToken &&
|
||||
!(options.storyforgeUsername && options.storyforgePassword)
|
||||
) {
|
||||
throw new Error("Sync mode requires --storyforge-token or both --storyforge-username and --storyforge-password");
|
||||
}
|
||||
|
||||
const runDir = path.join(
|
||||
options.outputDir,
|
||||
`${nowStamp()}-${sanitizeName(options.profileUrl.split("/").pop() || "douyin")}`
|
||||
);
|
||||
await ensureDir(runDir);
|
||||
await ensureDir(options.stateDir);
|
||||
|
||||
const context = await chromium.launchPersistentContext(options.stateDir, {
|
||||
headless: options.headless,
|
||||
viewport: { width: 1440, height: 1024 },
|
||||
args: ["--disable-blink-features=AutomationControlled"]
|
||||
});
|
||||
|
||||
try {
|
||||
const page = await context.newPage();
|
||||
const responseCapture = await createResponseCapture(page);
|
||||
console.error(`Opening profile page: ${options.profileUrl}`);
|
||||
await navigateAndSettle(page, options.profileUrl, options.waitMs);
|
||||
await maybePrompt(
|
||||
`Browser opened ${options.profileUrl}.\nLog into Douyin if needed, solve any slider/captcha, and optionally click into the creator homepage before capture.`,
|
||||
options.manualPrompt
|
||||
);
|
||||
await prepareProfilePage(page, options);
|
||||
await sleep(options.waitMs);
|
||||
const videoLinks = await collectVideoLinks(page);
|
||||
console.error(`Collected ${videoLinks.length} candidate video links`);
|
||||
const profileBundle = await capturePageBundle(page, "profile", responseCapture, { video_links: videoLinks });
|
||||
await saveJson(path.join(runDir, "profile-bundle.json"), profileBundle);
|
||||
await page.close().catch(() => {});
|
||||
|
||||
const creatorPages = await captureCreatorPages(context, options, runDir);
|
||||
const videoPages = await captureVideoPages(context, videoLinks, options, runDir);
|
||||
|
||||
const syncBody = {
|
||||
profile_url: options.profileUrl,
|
||||
manual_profile_payload: profileBundle,
|
||||
manual_creator_pages: creatorPages,
|
||||
manual_work_payloads: videoPages,
|
||||
discovery_note: options.note || "browser-assisted capture"
|
||||
};
|
||||
await saveJson(path.join(runDir, "storyforge-sync-request.json"), syncBody);
|
||||
|
||||
const summary = {
|
||||
profile_url: options.profileUrl,
|
||||
output_dir: runDir,
|
||||
video_link_count: videoLinks.length,
|
||||
captured_video_pages: videoPages.length,
|
||||
captured_creator_pages: creatorPages.length,
|
||||
sync_enabled: options.syncEnabled
|
||||
};
|
||||
|
||||
if (options.syncEnabled) {
|
||||
let token = options.storyforgeToken;
|
||||
if (!token) {
|
||||
const auth = await loginStoryForge(
|
||||
options.backendUrl,
|
||||
options.storyforgeUsername,
|
||||
options.storyforgePassword
|
||||
);
|
||||
token = auth.token;
|
||||
await saveJson(path.join(runDir, "storyforge-login.json"), {
|
||||
account: auth.account,
|
||||
default_external_base_url: auth.default_external_base_url
|
||||
});
|
||||
}
|
||||
const workspace = await syncCapture(options.backendUrl, token, syncBody);
|
||||
summary.sync_result = {
|
||||
account_id: workspace.account?.id || "",
|
||||
nickname: workspace.account?.nickname || "",
|
||||
sync_errors: workspace.sync_errors || []
|
||||
};
|
||||
await saveJson(path.join(runDir, "storyforge-sync-response.json"), workspace);
|
||||
}
|
||||
|
||||
await saveJson(path.join(runDir, "summary.json"), summary);
|
||||
console.log(JSON.stringify(summary, null, 2));
|
||||
} finally {
|
||||
await context.close().catch(() => {});
|
||||
}
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error(error?.stack || String(error));
|
||||
process.exitCode = 1;
|
||||
});
|
||||
59
scripts/douyin-browser-capture/package-lock.json
generated
Normal file
59
scripts/douyin-browser-capture/package-lock.json
generated
Normal file
@@ -0,0 +1,59 @@
|
||||
{
|
||||
"name": "storyforge-douyin-browser-capture",
|
||||
"version": "0.1.0",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "storyforge-douyin-browser-capture",
|
||||
"version": "0.1.0",
|
||||
"dependencies": {
|
||||
"playwright": "^1.56.1"
|
||||
}
|
||||
},
|
||||
"node_modules/fsevents": {
|
||||
"version": "2.3.2",
|
||||
"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.2.tgz",
|
||||
"integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
|
||||
"hasInstallScript": true,
|
||||
"license": "MIT",
|
||||
"optional": true,
|
||||
"os": [
|
||||
"darwin"
|
||||
],
|
||||
"engines": {
|
||||
"node": "^8.16.0 || ^10.6.0 || >=11.0.0"
|
||||
}
|
||||
},
|
||||
"node_modules/playwright": {
|
||||
"version": "1.58.2",
|
||||
"resolved": "https://registry.npmjs.org/playwright/-/playwright-1.58.2.tgz",
|
||||
"integrity": "sha512-vA30H8Nvkq/cPBnNw4Q8TWz1EJyqgpuinBcHET0YVJVFldr8JDNiU9LaWAE1KqSkRYazuaBhTpB5ZzShOezQ6A==",
|
||||
"license": "Apache-2.0",
|
||||
"dependencies": {
|
||||
"playwright-core": "1.58.2"
|
||||
},
|
||||
"bin": {
|
||||
"playwright": "cli.js"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=18"
|
||||
},
|
||||
"optionalDependencies": {
|
||||
"fsevents": "2.3.2"
|
||||
}
|
||||
},
|
||||
"node_modules/playwright-core": {
|
||||
"version": "1.58.2",
|
||||
"resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.58.2.tgz",
|
||||
"integrity": "sha512-yZkEtftgwS8CsfYo7nm0KE8jsvm6i/PTgVtB8DL726wNf6H2IMsDuxCpJj59KDaxCtSnrWan2AeDqM7JBaultg==",
|
||||
"license": "Apache-2.0",
|
||||
"bin": {
|
||||
"playwright-core": "cli.js"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=18"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
14
scripts/douyin-browser-capture/package.json
Normal file
14
scripts/douyin-browser-capture/package.json
Normal file
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"name": "storyforge-douyin-browser-capture",
|
||||
"version": "0.1.0",
|
||||
"private": true,
|
||||
"type": "module",
|
||||
"description": "Browser-assisted Douyin capture and sync tool for StoryForge",
|
||||
"scripts": {
|
||||
"capture": "node ./capture_and_sync.mjs",
|
||||
"help": "node ./capture_and_sync.mjs --help"
|
||||
},
|
||||
"dependencies": {
|
||||
"playwright": "^1.56.1"
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user