Hermes 多 bot 编排手把手教程 v1.0

写给完全没动过 Hermes / lark-cli / 飞书开放平台的人。从零装一台 macOS，跑出 2 个以上各司其职的飞书 bot Agent，并让它们稳定挂在后台不挂掉。

每一章按 为什么这么做 → 具体命令步骤 → 怎么验证成功 → 常见报错与修复 走。每章结尾有「检查点」，对不上就别往下走。

Hermes 多 bot 编排手把手教程 v1.0

写给完全没动过 Hermes / lark-cli / 飞书开放平台的人。从零装一台 macOS，跑出 2 个以上各司其职的飞书 bot Agent，并让它们稳定挂在后台不挂掉。

每一章都按这个结构走：为什么这么做 → 具体命令步骤 → 怎么验证成功 → 常见报错与修复。每章结尾有「检查点」，对不上就别往下走。

0. 前言：你要做什么 / 为什么用 Hermes

为什么这么做

多 bot 编排就是让多个独立 Agent 同时跑在你电脑里，各管一块业务。比如一个 bot 专门接群消息整理待办、一个 bot 专门发提醒、一个 bot 专门写朋友圈草稿、一个 bot 专门处理客户私聊。每个 bot 有自己的角色设定、记忆库、活动范围，互不干扰。

为什么不是「一个超级全能 bot 把所有事都管」？因为单 bot 最大的麻烦不是技术，是职责混乱。一个 bot 既要回客户又要发朋友圈又要发提醒，它的人设和指令会互相打架；它收到一句模糊指令，不知道该走哪个分支。多 bot 把职责切开，每个 bot 只守自己那块，调试边界清晰、出错容易定位。

为什么不是「直接调 OpenAI / Anthropic API 自己写代码」？写过的都知道，难点不在调模型，难在：长连接、消息收发、群成员变动事件、文件上传下载、OAuth 授权、token 刷新、进程守护、错误重试、多账号隔离。这些 Hermes 都封装好了。你只配 YAML 和 prompt，剩下交给框架。

Hermes / OpenClaw / 自写代码对比简表

维度	Hermes Agent	OpenClaw / Astrbot 类	自写 Python SDK
上手成本	中（要懂 CLI + YAML）	低（图形界面）	高
多 bot 编排	原生支持（profile 机制）	部分支持	全靠自己设计
LLM 接入	内置 ChatGPT Plus OAuth / API key 多档位	多数只支持 API key	自己接
飞书 / 微信 / Discord 通道	内置	部分内置	自己写
长期维护	框架升级跟随	跟随社区	全靠自己
适合谁	想稳定跑多 Agent 长期不爆炸	个人玩家 / 单 bot 简单场景	工程团队

工程量预期

首次部署：从买域名/注册账号到 2 个 bot 都跑通，2-4 小时。卡点主要在飞书 scope 审批等待和 OAuth 调试。
维护边际成本：稳定后大概每周看一眼日志、每月处理一次 token 过期。watchdog 装好后基本不用管。
加新 bot：第 1 个之后，每加 1 个大约 20-30 分钟。

检查点 0

你能说出"为什么不用单 bot 全包"的一个理由
你接受首次部署要花半天到一天

1. 准备工作

为什么这么做

后面所有命令都默认你这些东西齐了。提前对一下清单，省得装到一半发现少东西要回头补。

具体命令步骤

硬件：

macOS 12 或更高（Apple Silicon 或 Intel 都可以）
至少 8GB 内存（同时跑 4 个 bot 大约吃 1-2GB）
10GB 可用磁盘（Hermes 本体 + 各 bot 工作目录 + 日志）

账号：

飞书账号（个人或企业都行，要能登录 open.feishu.cn 进开发者后台）
ChatGPT Plus 订阅（首选，走 OAuth 不另花 API 费）或 OpenAI API key（按量付费，少量测试也够）
GitHub 账号（拉 Hermes 源码 / 后续装 watchdog 脚本）

工具（按顺序装）：

# 1. 装 Homebrew（如果没装过）
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# 2. 用 Homebrew 装 Python 3.11+ 和 Git
brew install python@3.11 git

# 3. 验证版本
python3 --version    # 应该 >= 3.11
git --version        # 任意版本都行

怎么验证成功

brew --version
python3 -c "import sys; print(sys.version_info)"

输出能看到 Homebrew 版本号和 Python 主版本号即可。

常见报错与修复

报错	真因	修复
`command not found: brew`	PATH 没把 Homebrew 加进去	按安装时屏幕提示跑 `eval "$(/opt/homebrew/bin/brew shellenv)"`，再写进 `~/.zshrc`
`Python 3.9` 而不是 3.11	系统自带的旧 Python 优先级高	用 `python3.11` 显式调用，或装 `pyenv` 切版本
飞书开放平台进不去	用了 lark.com / feishu.cn 错误后缀	国内版地址是 `open.feishu.cn`，国际版是 `open.larksuite.com`

检查点 1

brew --version 能跑
python3 --version >= 3.11
能用浏览器登进 open.feishu.cn 后台

2. 安装 Hermes Agent

为什么这么做

Hermes 是一套 Agent 运行框架，提供 profile 隔离、OAuth 管理、launchd 守护、IM 通道接入。你后面所有 bot 都跑在它上面。它本身只装一次，多个 bot 共享同一份 Hermes 二进制。

具体命令步骤

# 1. 用官方安装脚本（最简）
curl -fsSL https://hermes-agent.ai/install.sh | bash

# 或者从源码装
git clone https://github.com/NousResearch/hermes-agent.git ~/hermes-src
cd ~/hermes-src && pip install -e .

# 2. 首次配置向导
hermes setup
# 它会问你：HERMES_HOME 默认目录（按回车用 ~/.hermes）
#           是否启用 dashboard（推荐 yes，端口默认 9119）
#           默认 LLM 模型（按回车用 gpt-5 系列）

# 3. 接 ChatGPT Plus OAuth（首选方案）
hermes auth login openai-codex
# 终端会显示一行 URL + 一段 device code
# 把 URL 复制到浏览器，粘 device code，完成授权
# 回到终端看到 "Logged in" 即成功

# 如果你用 API key 方案，跳过上一步，改成：
# hermes auth login openai --api-key sk-xxx

怎么验证成功

hermes --version                    # 看到版本号
hermes auth status                  # 看到 openai-codex: logged in
hermes gateway list                 # 看到 default profile（空的，正常）
hermes dashboard                    # 跑起来后浏览器开 http://localhost:9119

常见报错与修复

报错	真因	修复
`hermes: command not found` 装完后	PATH 没刷新	关掉重开终端，或 `source ~/.zshrc`
OAuth 浏览器跳转打不开	国内网络拦截 ChatGPT 域名	开代理，或换用 API key 方案
dashboard 端口 9119 被占	端口冲突	`hermes dashboard --port 9120` 换一个
`pip install -e .` 报 `error: externally-managed-environment`	macOS 系统 Python 不让全局装	用 `python3 -m venv ~/.hermes-venv` 建虚拟环境，激活后再 install

检查点 2

hermes --version 有输出
hermes auth status 显示 logged in
hermes gateway list 命令能跑（即使列表为空）

3. 在飞书开放平台创建第 1 个 bot 应用

为什么这么做

每个 bot = 一个独立的飞书"自建应用"。它有自己的 app_id、app_secret，独立的权限集合，独立的群活动边界。绝对不能多个 bot 共用一个 app——共用一旦出问题会牵连所有 bot，权限分配也乱。

这一章是整个流程最容易卡住的地方，因为飞书的 scope 申请要等内部审批（个人开发者可能秒过，企业开发者可能要管理员批），最长有等过 24 小时的。建议第 1 个 bot 一次申请够 scope，后续 bot 复制配置。

具体命令步骤

步骤 1：创建自建应用

浏览器打开 https://open.feishu.cn/app
右上角「创建企业自建应用」
填：

应用名称：你给这个 bot 起的名字（中英文都行）
应用描述：随便写一句
应用图标：可选

创建完进入应用详情页

步骤 2：拿到 app_id 和 app_secret

在「凭证与基础信息」页面看到：

App ID（格式 cli_axxxxxxxxxxxxxxx）
App Secret（点「查看」展示，立刻复制存到密码管理器）

⚠️ App Secret 只在第一次创建时方便看，丢了就要重置。重置后旧 secret 立刻失效，已部署的 bot 会全部断连。第一次就保存好。

步骤 3：配置基础能力

进入「应用功能」→「机器人」→ 启用机器人能力，填写 bot 显示名和说明。

步骤 4：申请权限 scope

进入「权限管理」→「申请权限」，按下面这个最小集勾选（覆盖收发消息 + 群管理 + 读用户信息）：

scope 名称	作用
`im:message`	接收消息
`im:message.group_at_msg`	接收群里 @ 它的消息
`im:message.group_at_msg:readonly`	同上只读
`im:message.p2p_msg`	接收私聊消息
`im:message:send_as_bot`	以 bot 身份发消息（最常用）
`im:chat`	读群信息
`im:chat:readonly`	同上只读
`im:resource`	上传下载图片文件
`contact:user.id:readonly`	读用户 ID
`contact:user.base:readonly`	读用户基础信息

如果你的 bot 需要"以用户身份"发消息（比如代你转发到群里显示是你发的），额外加：

im:message:send_as_user

申请完点「申请」，状态会变成「待管理员审核」。个人开发者一般秒过，企业开发者要等公司飞书管理员批准。

步骤 5：开启长连接事件订阅

进入「事件与回调」→「事件配置」→ 选「使用长连接接收事件」（不要选 URL 模式——URL 模式要你有外网域名，长连接走 ws 不需要）。

订阅事件清单：

im.message.receive_v1（消息接收）
im.chat.member.user.added_v1（群成员加入）
im.chat.member.user.deleted_v1（群成员离开）

按需勾，但至少要勾消息接收，不然 bot 收不到任何东西。

步骤 6：发布版本

进入「版本管理与发布」→「创建版本」→ 写个版本号（如 1.0.0）→ 提交。没发布的版本相当于没上线，bot 进任何群都无效。

怎么验证成功

「权限管理」页面看到你勾的 scope 状态全是「已开通」（不是「待审核」）
「版本管理」页面看到一个「已发布」的版本
拿一个测试群，把 bot 拉进去（群设置→机器人→添加）

常见报错与修复

现象	真因	修复
scope 申请提交后一直「待审核」	企业管理员没批	找飞书管理员，或换个人版账号
bot 拉进群后无任何反应	还没发版本，或 scope 没批，或后面 OAuth 没做	检查发版状态 + scope 状态 + 完成下一章
App Secret 忘了存	这次性显示，没记	进「凭证」点「重置」拿新的，已部署 bot 要全部重做
「事件回调」选了 URL 模式但没有域名	选错模式	改回「长连接」模式

检查点 3

拿到 App ID 和 App Secret 都存到密码管理器
「权限管理」里申请的 scope 全部状态「已开通」
「版本管理」有一个「已发布」版本
bot 能被拉进测试群（图标显示在群机器人列表里）

4. 把 bot app 装进 lark-cli

为什么这么做

lark-cli 是一个独立 CLI 工具，专门用来调飞书 OpenAPI（发消息、读群、操作多维表格、读云文档等）。你的 Hermes Agent 内部用它，你手动调试也用它。

它的关键设计：一个 workspace 配置文件，可以装多个 bot 的凭证——后面加第 2 第 3 个 bot 不用各装一份 cli。bot 通过 --profile <名字> 标识切换身份。

具体命令步骤

# 1. 装 lark-cli（如果没装过）
npm install -g @larksuite/lark-cli
# 或用 brew tap 装

# 2. 初始化 Hermes 上下文下的 workspace
export HERMES_HOME=~/.hermes
mkdir -p ~/.lark-cli/hermes

# 3. 把第 1 个 bot 加进去
# 把 <APP_ID> 和 <APP_SECRET> 换成你刚才记下的值
echo -n "<APP_SECRET>" | HERMES_HOME=~/.hermes lark-cli profile add \
  --app-id "<APP_ID>" \
  --name "<BOT_NAME>" \
  --app-secret-stdin

# <BOT_NAME> 是你给这个 bot 起的内部代号，后面所有命令都用它指代
# 比如 bot1 / agent-task / notifier 等

# 4. 切到这个 profile（让它当 active）
HERMES_HOME=~/.hermes lark-cli profile use <BOT_NAME>

⚠️ 注意：app_secret 必须通过 stdin 传入（--app-secret-stdin），不要写在命令行参数里——命令行历史会留痕。

怎么验证成功

HERMES_HOME=~/.hermes lark-cli profile list
# 应该输出一个表格，看到 <BOT_NAME> 在列表里，且 active 列为 ✓

HERMES_HOME=~/.hermes lark-cli auth status --profile <BOT_NAME>
# 此时还没 OAuth，所以会显示 tokenStatus 为空或 invalid
# 这步只是验证配置已写入，下一章才做 OAuth

常见报错与修复

报错	真因	修复
`unknown command: profile`	lark-cli 版本旧	`npm update -g @larksuite/lark-cli` 升级
`app_secret cannot be empty`	stdin 没真正传进去	`echo -n` 必须用 `-n`，不要漏
`profile already exists`	重复 add 同名 profile	改名，或用 `lark-cli profile remove <name>` 删了再加
看不到 `--profile` 参数	用了 `lark-cli config bind` 误装到旧位置	禁用 `config bind`！它是 replace 模式，会把整个 apps 数组覆盖。永远只用 `profile add`。

检查点 4

lark-cli profile list 能看到你刚加的 bot
没用过 config bind（这是有名的毁性命令）

5. OAuth Device Flow 让 bot 拿到 user token

为什么这么做

光配 app_id + app_secret 只能拿到 tenant_access_token（应用维度的 token，权限有限）。要让 bot 能"以用户身份"做一些事（读你私有云文档、用你身份发消息等），必须再走一遍 OAuth 拿到 user_access_token。

Device Flow 适合 CLI 场景：终端给你一个 URL 和 code，你去浏览器输入完成授权，token 写回本地。

具体命令步骤

# 1. 切到目标 profile（确保下一步授权的是它）
HERMES_HOME=~/.hermes lark-cli profile use <BOT_NAME>

# 2. 触发 OAuth（用 --recommend 让 cli 自动请求一组推荐 scope）
HERMES_HOME=~/.hermes lark-cli auth login --recommend

# 输出大概长这样：
# Please open the following URL in your browser:
#   https://open.feishu.cn/open-apis/authen/v1/device/code?...
# Then enter this code: ABCD-1234
#
# Waiting for authorization...

# 3. 复制 URL 到浏览器，登录飞书账号，输入 code，点同意

# 4. 终端会变成：
# ✓ Authorized successfully
# ✓ Token saved

怎么验证成功

HERMES_HOME=~/.hermes lark-cli auth status --profile <BOT_NAME>

# 应该看到：
# Profile: <BOT_NAME>
# tokenStatus: valid
# userOpenId: ou_xxxxxxxxxxxxxxx
# tokenExpiry: 2026-06-XX ...
# refreshTokenExpiry: 2026-XX-XX ...

记下 userOpenId——这是你在这个 bot 视角下的身份，后面给自己发 DM 要用它。

常见报错与修复

报错	真因	修复
`App pending approval`	上一章 scope 还没批	等飞书管理员审批，或个人版重新申请
`refresh_token reused`	OAuth credential 重复消费	重做：`lark-cli auth logout --profile X` 然后再 login
`99992361 open_id cross app`	你拿了另一个 bot 的 open_id 来用	每个 bot 看到的 open_id 不一样！必须用当前 bot `auth status` 显示的那个
`10003 invalid param` 发消息时	缺 `im:message:send_as_user` scope	回飞书后台补 scope → 等批准 → 重做 auth login
Device code 输入后页面卡死	飞书账号没绑该开发者租户	用创建 bot app 的同一个飞书账号授权

检查点 5

auth status 显示 tokenStatus: valid
记下了 userOpenId（建议存到一份 bot 速查表 文档里，多 bot 时方便对照）
拿这个 bot 给一个测试群发消息成功：

  HERMES_HOME=~/.hermes lark-cli im +messages-send \
    --profile <BOT_NAME> --as bot \
    --chat-id <CHAT_ID> --text "hello from bot"

6. 部署 launchd 服务让 gateway 永驻

为什么这么做

到这里 bot 能临时跑起来了，但你关终端它就死了。要让 bot 永远挂着，必须做 macOS launchd 服务（相当于 Linux 的 systemd）。

Hermes 自带 plist 模板（launchd 配置文件），一行命令就能装。不要自己手写 plist——手写跟自带模板会冲突，后期重启会互相打架。

具体命令步骤

# 1. 装 default profile 的 gateway 服务
hermes gateway install --force

# 2. 立刻启动它
hermes gateway start

# 3. 装 dashboard 服务（Web 后台 UI）
hermes dashboard install --force
hermes dashboard start

⚠️ 极其重要的坑：

如果你想自定义 launchd 日志路径（plist 里的 StandardOutPath / StandardErrorPath），绝对不要写到 ~/Desktop/ 下面。

为什么？macOS 的 TCC（Transparency, Consent, and Control）隐私保护机制会拦截 LaunchAgent 写入 Desktop / Documents / Downloads 这类受保护目录，LaunchAgent 没有 GUI 弹窗权限，所以它不能让用户授权，直接收到 errno 1（operation not permitted），进程 exit code 78，然后 launchd 的 KeepAlive=true 会无限重启它，单核 CPU 烧 100% 死循环。

正确日志路径选 /tmp/ 或 ~/Library/Logs/<your-prefix>/：

<key>StandardOutPath</key>
<string>/tmp/hermes-gateway.out.log</string>
<key>StandardErrorPath</key>
<string>/tmp/hermes-gateway.err.log</string>

或者直接用 Hermes 自带模板（它已经处理好路径）的默认路径 ~/.hermes/logs/，不要去改。

怎么验证成功

# 1. 看 launchd 注册了
launchctl list | grep ai.hermes
# 应该看到：
# -    0    ai.hermes.gateway
# -    0    ai.hermes.dashboard

# 2. 看进程在跑
hermes gateway list
# 输出表格里看到 ✓ 在跑

# 3. 看 dashboard 能开
open http://localhost:9119

# 4. 看日志写得正常
tail -20 ~/.hermes/logs/gateway.log
# 应该看到 [Lark] connected to wss://msg-frontier.feishu.cn/...

# 5. 重启电脑测一次（最终检验）
# 重启后不用任何操作，hermes gateway list 应该自动有 ✓

常见报错与修复

报错	真因	修复
`exit code 78` 反复重启	日志路径在 Desktop 被 TCC 拦	改 plist 的 StandardOutPath 到 `/tmp/`，`launchctl unload` 旧的再 load 新的
`launchctl list` 没看到 ai.hermes	install 没跑成	重跑 `hermes gateway install --force` 注意有没有报错输出
`Another gateway instance is already running`	你既装了 hermes 自带 plist 又手写了一份	删手写的：`launchctl bootout gui/$(id -u)/<旧 label>` + 删 `~/Library/LaunchAgents/<旧>.plist`
dashboard 进不去 9119	端口被占或 dashboard 没起	`lsof -i:9119` 看谁占；`hermes dashboard restart` 重启
重启电脑后没自动起来	plist 没勾 RunAtLoad，或装到 ~/Library/LaunchDaemons 而不是 LaunchAgents	hermes 自带模板默认就对，重新 `install --force` 一次

检查点 6

launchctl list | grep ai.hermes 能看到 gateway 和 dashboard
hermes gateway list 看到 ✓
重启电脑后无需任何操作，gateway 自动起
日志路径不在 ~/Desktop/ 下（自查 plist 的 StandardOutPath）

7. 加第二个 bot（多 profile 编排）

为什么这么做

走到这里已经有 1 个 bot 在永驻跑了。但单 bot 容易职责打架——你要的是多个 bot 各管一块业务。加第 2 个 bot 等于让你有第 2 个独立的人设、记忆、群活动边界。

核心概念：Hermes profile。每个 profile 对应一个独立 bot，工作目录、日志、记忆全部隔离在 ~/.hermes/profiles/<BOT_NAME>/ 里。但 LLM 凭证、lark-cli 配置等无状态资源共享（省事，下一章会讲 OAuth symlink 共享）。

具体命令步骤

步骤 1：在飞书开放平台再建一个 bot app（复习第 3 章流程，每个 bot 必须独立 app）

拿到第 2 个 app 的 <APP_ID_2> + <APP_SECRET_2>。

步骤 2：在 Hermes 里创建第 2 个 profile

# 创建子 profile 工作目录
hermes -p <BOT_NAME_2> setup
# 它会问你跟第一次差不多的问题。HERMES_HOME 选 ~/.hermes/profiles/<BOT_NAME_2>

# 把飞书 app 凭证写到该 profile 的 .env
cat >> ~/.hermes/profiles/<BOT_NAME_2>/.env <<EOF
FEISHU_APP_ID=<APP_ID_2>
FEISHU_APP_SECRET=<APP_SECRET_2>
EOF

步骤 3：把第 2 个 bot 也加进 lark-cli

# 用 profile add（追加模式），不要用 config bind
echo -n "<APP_SECRET_2>" | HERMES_HOME=~/.hermes lark-cli profile add \
  --app-id "<APP_ID_2>" \
  --name "<BOT_NAME_2>" \
  --app-secret-stdin

# 切到新 profile + OAuth
HERMES_HOME=~/.hermes lark-cli profile use <BOT_NAME_2>
HERMES_HOME=~/.hermes lark-cli auth login --recommend
# 浏览器完成 device flow

⚠️ 关键约束：config bind 是 replace 模式——它会覆盖整个 apps 数组，把你第 1 个 bot 的配置全删掉。永远只用 profile add（append 模式）。这是踩过无数次的雷。

步骤 4：装第 2 个 gateway launchd 服务

hermes -p <BOT_NAME_2> gateway install --force
hermes -p <BOT_NAME_2> gateway start

-p 是 --profile 的缩写，主入口切 profile 用 -p（这点很多人会搞错——--profile 是某些子命令内部的 flag，不是主入口的 flag，误用会静默路由错 profile 又不报错）。

步骤 5：共享 LLM OAuth（如果你用 ChatGPT Plus 一个账号给多 bot 用）

# 删第 2 个 profile 自己的 auth.json，symlink 到主 profile 的
rm -f ~/.hermes/profiles/<BOT_NAME_2>/auth.json
ln -sfn ~/.hermes/auth.json ~/.hermes/profiles/<BOT_NAME_2>/auth.json

为什么必须用 symlink 不用 cp？OAuth refresh token 是一次性消耗品——一个 profile refresh 成功后，旧 refresh token 立刻失效。如果你用 cp 复制了两份独立文件，第二个 profile 拿旧 token 调用会报 refresh_token_reused，三天两头挂。symlink 让两个 profile 物理上读同一份文件，物理上不可能撞车。

怎么验证成功

# 1. 看 lark-cli 装着 2 个 bot
HERMES_HOME=~/.hermes lark-cli profile list
# 输出应该有 <BOT_NAME> 和 <BOT_NAME_2> 两行

# 2. 验证多 bot 调用不漂身份（最关键的一步）
# 各自身份给自己发一条消息
HERMES_HOME=~/.hermes lark-cli im +messages-send \
  --profile <BOT_NAME> --as bot \
  --chat-id <CHAT_ID> --text "I am bot 1"

HERMES_HOME=~/.hermes lark-cli im +messages-send \
  --profile <BOT_NAME_2> --as bot \
  --chat-id <CHAT_ID> --text "I am bot 2"

# 去飞书群里确认看到的发送者头像/名字真的是两个不同 bot

# 3. 看两个 gateway 都跑着
hermes gateway list
# 应该看到 default 和 <BOT_NAME_2> 都 ✓

# 4. 验证 OAuth symlink 没断
ls -la ~/.hermes/profiles/<BOT_NAME_2>/auth.json
# 输出应该是 lrwxr-xr-x ... auth.json -> /Users/.../.hermes/auth.json

常见报错与修复

报错	真因	修复
两个 bot 在飞书显示同一个头像	误把两个 bot 配到了同一个 app	各自必须独立 app，回飞书后台再建一个
`profile add` 后 bot1 配置丢了	误用了 `config bind`	备份还原（`~/.lark-cli/hermes/config.json.bak-*`），重新只用 `profile add`
第 2 bot 收发消息正常但 LLM 不答	OAuth symlink 没建	`ln -sfn ~/.hermes/auth.json ~/.hermes/profiles/<BOT_NAME_2>/auth.json`
`refresh_token_reused`	OAuth 用了 cp 复制不是 symlink	改 symlink（先 `rm -f` 再 `ln -sfn`）
发消息漂身份（bot1 说了 bot2 该说的话）	调用时漏了 `--profile`	检查脚本所有 lark-cli 调用都带 `--profile <name>`

检查点 7

lark-cli profile list 看到 2 个 bot
两个 bot 在飞书群里发的消息显示不同的发送者
hermes gateway list 看到 2 个 ✓
第 2 个 profile 的 auth.json 是 symlink（不是普通文件）

8. 装监控 watchdog

为什么这么做

部署完不装监控就是定时炸弹。这几类事故会让 bot 静默死掉而你完全不知道：

OAuth token 过期（7-30 天）：refresh 失败后 token 失效，bot 跟 LLM 通信全部 401。
gateway 进程偶发 ws 断连：飞书 ws 服务端 keepalive 超时，hermes 没自愈，bot 收不到消息但进程还活着。
第三方代理（Clash 等）退出导致 cache 旧值：你白天开着代理启动了 bot，晚上关了代理，bot 进程内存里还 cache 着 127.0.0.1:7890，反复试代理失败，bot 完全瘫但进程 ✓。
第三方桥服务限流：如果你接了非官方微信 / Discord 桥，它们的 session 会陈旧。

watchdog 就是定时跑一段脚本，检查这些症状，发现异常自动修复（重启 gateway / 触发 OAuth refresh），修不了就告警。

具体命令步骤

步骤 1：写 health-check 脚本

mkdir -p ~/.hermes/watchdog
cat > ~/.hermes/watchdog/health-check.sh <<'EOF'
#!/bin/bash
# Hermes 多 bot 健康检查

BOTS=("default" "<BOT_NAME_2>")  # 按你的 profile 名列全

for bot in "${BOTS[@]}"; do
    # 1. 检查 launchd PID 在
    if [ "$bot" = "default" ]; then
        label="ai.hermes.gateway"
        log_dir="$HOME/.hermes/logs"
    else
        label="ai.hermes.gateway-$bot"
        log_dir="$HOME/.hermes/profiles/$bot/logs"
    fi

    if ! launchctl list | grep -q "$label"; then
        echo "[ALERT] $bot launchd not running"
        continue
    fi

    # 2. 检查 errors.log 最近 30 分钟有没有反复的 ProxyError / refresh_token_reused
    if [ -f "$log_dir/errors.log" ]; then
        recent_errors=$(tail -200 "$log_dir/errors.log" | grep -E "ProxyError|refresh_token_reused|rate limited" | wc -l)
        if [ "$recent_errors" -gt 5 ]; then
            echo "[ALERT] $bot has $recent_errors recent errors → auto-restarting"
            if [ "$bot" = "default" ]; then
                hermes gateway restart
            else
                hermes -p "$bot" gateway restart
            fi
        fi
    fi
done
EOF
chmod +x ~/.hermes/watchdog/health-check.sh

步骤 2：装 launchd 让它每 10 分钟跑一次

cat > ~/Library/LaunchAgents/local.hermes.watchdog.plist <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>local.hermes.watchdog</string>
    <key>ProgramArguments</key>
    <array>
        <string>/bin/bash</string>
        <string>$HOME/.hermes/watchdog/health-check.sh</string>
    </array>
    <key>StartInterval</key>
    <integer>600</integer>
    <key>RunAtLoad</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/hermes-watchdog.out.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/hermes-watchdog.err.log</string>
</dict>
</plist>
EOF

launchctl load ~/Library/LaunchAgents/local.hermes.watchdog.plist

⚠️ 再强调一次：watchdog plist 的日志路径绝对不能写到 Desktop——会触发 TCC 拦截、exit code 78、KeepAlive 死循环。已经有人因为这个坑 watchdog 挂了一周才发现，期间所有 OAuth 都过期了，整个系统全停摆。

步骤 3：装 OAuth refresh watchdog（每日定时）

cat > ~/.hermes/watchdog/oauth-refresh.sh <<'EOF'
#!/bin/bash
# 每日检查 LLM token 寿命，剩余 < 24h 主动 refresh

# 用 hermes auth status 检查 token 寿命
expiry=$(hermes auth status openai-codex --json 2>/dev/null | python3 -c "
import sys, json, datetime
try:
    d = json.load(sys.stdin)
    exp = d.get('expiry_timestamp', 0)
    remaining_hours = (exp - datetime.datetime.now().timestamp()) / 3600
    print(int(remaining_hours))
except:
    print(-1)
")

if [ "$expiry" -lt 24 ] && [ "$expiry" -ge 0 ]; then
    echo "Token expires in $expiry hours, refreshing..."
    hermes auth refresh openai-codex
    # 重启所有 bot 让它们读新 token
    hermes gateway restart
    for bot in <BOT_NAME_2>; do
        hermes -p "$bot" gateway restart
    done
fi
EOF
chmod +x ~/.hermes/watchdog/oauth-refresh.sh

装 launchd（每天凌晨 4 点跑一次）：

cat > ~/Library/LaunchAgents/local.hermes.oauth-refresh.plist <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>local.hermes.oauth-refresh</string>
    <key>ProgramArguments</key>
    <array>
        <string>/bin/bash</string>
        <string>$HOME/.hermes/watchdog/oauth-refresh.sh</string>
    </array>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key>
        <integer>4</integer>
        <key>Minute</key>
        <integer>0</integer>
    </dict>
    <key>StandardOutPath</key>
    <string>/tmp/hermes-oauth.out.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/hermes-oauth.err.log</string>
</dict>
</plist>
EOF

launchctl load ~/Library/LaunchAgents/local.hermes.oauth-refresh.plist

怎么验证成功

# 1. watchdog 注册成功
launchctl list | grep local.hermes
# 应该看到 watchdog 和 oauth-refresh 两条

# 2. 手动跑一次 health-check 看输出
bash ~/.hermes/watchdog/health-check.sh
# 健康时无输出；有问题时打印 ALERT 行

# 3. 看 watchdog 自己的日志（确认它在按时跑）
tail /tmp/hermes-watchdog.out.log

# 4. 模拟一次故障测告警链路
# 故意 stop 一个 bot
hermes -p <BOT_NAME_2> gateway stop
# 等 10 分钟后看 watchdog 是否检测到 + 重启
tail /tmp/hermes-watchdog.out.log

常见报错与修复

报错	真因	修复
watchdog plist 加载失败 exit 78	日志路径在 Desktop	改 `/tmp/`，重新 unload+load
watchdog 跑了但没检测到故障	脚本里 BOTS 数组没列全，或 label 名拼错	对照 `launchctl list \	grep ai.hermes` 实际 label 改脚本
OAuth refresh 失败 token 真过期了	多 bot 共享 OAuth 撞车	检查所有子 profile 的 auth.json 都是 symlink；手动重做 `hermes auth login openai-codex`
watchdog 自己反复重启 bot 又反复挂	bot 启动本身有问题（如 scope 缺）	看 bot 的 gateway.log 找根因，不是 watchdog 的锅
告警太频繁刷屏	没做 dedup	在脚本里加 `.alert-state.json` 记上次告警时间，1 小时内同一类不重复告警

检查点 8

launchctl list | grep local.hermes 看到 watchdog 和 oauth-refresh
手动跑 health-check 输出符合预期（健康时无输出）
watchdog 日志路径不在 Desktop
至少做过一次"故意停 bot 看 watchdog 救回来"的演练

9. 常见故障速查表

把日常会遇到的故障按"症状 → 真因 → 一行修复"整理成速查表。调试顺序原则：先看 errors.log，再考虑重启，最后才考虑 OAuth（顺序反了浪费时间）。

症状	真因	一行修复
`ProxyError 127.0.0.1:7890` 反复出现	Clash/ClashX 退出后 hermes 进程 cache 旧代理	`hermes -p <BOT_NAME> gateway restart`
`10003 invalid param` 调用 OpenAPI 时	user OAuth scope 缺失（如 `im:message:send_as_user`）	飞书后台补 scope → 等批 → `lark-cli auth login --profile <BOT_NAME>`
`99992361 open_id cross app`	用错了某个 bot 的 open_id（每 bot 视角下 open_id 不同）	`lark-cli --profile <BOT_NAME> auth status` 看对应 userOpenId
`refresh_token reused`	OAuth credential 多 profile cp 共享（不是 symlink）	重做 symlink：`ln -sfn ~/.hermes/auth.json ~/.hermes/profiles/<BOT_NAME_2>/auth.json`
launchd exit code 78 反复重启	StandardOutPath/ErrorPath 在 Desktop 被 TCC 拦	改 plist 路径到 `/tmp/`，`launchctl unload` 旧 plist 再 load 新 plist
`Another gateway instance is already running PID xxx`	手写 plist 跟 hermes 自带 plist 双装	`launchctl bootout gui/$(id -u)/<旧 label>` + 删旧 plist + `hermes gateway install --force`
bot 进程 ✓ 但飞书 @ 无反应	内存代理 cache / ws 断连 / OAuth 静默失效，任一种	第一步看 `errors.log`，按错误关键词查表
bot 给飞书发"Gateway shutting down"通知刷屏	OAuth refresh 后 Hermes 主动重启 bot，每 7 天一次	这是预期行为，不是故障
第三方桥（微信/discord）报 `rate limited`	client session 陈旧（不是真限流）	`hermes -p <BOT_NAME> gateway restart`
跑命令时 `--profile` 写错名字提示 not found	多 bot 时漏了 `HERMES_HOME=~/.hermes` 前缀	所有 lark-cli 命令前加 `HERMES_HOME=~/.hermes`，或写进 shell rc
多 bot 并发出现"A bot 发了 B 该发的话"	调用没硬编码 `--profile` 走了 active 状态	检查所有 lark-cli 调用都带 `--profile <BOT_NAME>`
dashboard 端口冲突	9119 被其他服务占了	`lsof -i:9119` 看占用者；改 hermes 用别的端口
watchdog 反复重启 bot 又反复挂	bot 启动本身有 bug（scope 缺 / .env 错）	看 bot 自己的 gateway.log 找根因
`App pending approval` 一直不过	企业管理员没批 scope	找飞书管理员，或换个人版
拉 bot 进群无反应	没发布版本 / 没启用机器人能力 / 长连接事件没勾消息接收	回飞书后台逐项检查
重启电脑后 bot 没自动起	plist 没装好（手写的没设 RunAtLoad）	`hermes gateway install --force` 重装自带模板

调试黄金顺序

遇到 bot 不响应，按这个顺序排查省 90% 时间：

看错误日志（5 秒）：tail -50 ~/.hermes/profiles/<BOT_NAME>/logs/errors.log
看进程状态（5 秒）：hermes gateway list
重启 gateway（30 秒）：hermes -p <BOT_NAME> gateway restart
重启后还不行才查 OAuth：lark-cli auth status --profile <BOT_NAME>
最后才动飞书后台 scope 配置（要等审批）

90% 的故障在第 3 步就解决了。

10. 进阶方向

lark-cli shell 调用模式的天然限制

到这里你已经能稳定跑多 bot 了。但 lark-cli + shell 调用这套架构有几个结构性限制绕不开：

单 token 多 bot 共享 OAuth 必然有同步窗口：refresh token 是一次性，再怎么 symlink 共享也必须有"刚 refresh 完所有 bot 必须重启读新 token"的窗口。这就是为什么前面 watchdog 设计成"每日凌晨 refresh + 全 bot 重启"。
shell 调用每次起一个新进程：lark-cli 是 Node CLI，每次调用 startup 时间 100-300ms，频繁调用（如群消息处理）会有延迟感。
依赖 active profile 状态的隐患永远在：哪怕你硬编码每次 --profile，只要有任何遗漏点就会身份漂。这是 lark-cli 设计本身的债务。

业界主流参考

如果你打算长期维护一个超过 3 bot 的系统，可以参考几个开源大项目的做法：

AstrBot（github.com/AstrBotDevs/AstrBot，33k+ ⭐）：Python 写，直接调 lark-oapi SDK，每 bot 独立 lark.Client 实例。
LangBot（github.com/langbot-app/LangBot，16k+ ⭐）：同样 Python SDK 直调。

它们的共同模式：绕开 CLI，用 SDK 在进程内维护多 Client 实例。优点是物理隔离（每 bot 自己一个 Client 对象，根本不可能身份漂），缺点是失去 CLI 调试的方便。

下一步：MCP server

更现代的做法是写一个飞书 MCP server：

用 lark-oapi SDK 写 MCP 服务端，内部 N 个独立 Client
各 Hermes profile 挂载自己的 MCP server 实例
Agent skill 调 MCP tool 而不是 shell 调 CLI

收益：彻底脱离 lark-cli profile 切换，所有"硬编码 --profile / hook 拦截"补丁都可以删。MCP server 自带 schema、错误处理、并发安全。

短期保留当前架构没问题（用上面 watchdog + 硬编码 --profile 已经能跑稳），等你 bot 数量到 5+ 或者并发量到群消息密集场景再考虑 MCP 迁移。

附录：决策原则速查

1. 有状态的本地资源必须隔离，无状态的远程服务可以共享

资源	类型	怎么办
LLM API token / OAuth	远程服务凭证	symlink 共享（一个账号供多 bot 用）
OpenAI SDK / Python 包	工具实现	共享一份
Hermes 二进制	工具实现	共享一份
lark-cli 配置文件	本地凭证容器	单 workspace + 调用强制 --profile
launchd 进程	运行时	完全独立（一 bot 一 plist 一进程）
Hermes profile 工作目录	状态/缓存/记忆	完全独立

2. 命名空间统一比命名混搭稳

不要一半 plist 用 local.foo. 一半用 ai.hermes.。所有 launchd label 一套命名规则。所有 profile 名一套规则（如全小写英文）。重命名时全库 grep 引用点同步更新。

3. 先 errors.log 再重启再 OAuth

调试时严格按这个顺序，反过来浪费时间。

4. 监控失效本身是最大的故障

watchdog 比 bot 更重要——watchdog 挂了你完全不知道任何故障。装 watchdog 时多想一步"watchdog 自己挂了我怎么知道"。最简单的做法：watchdog 自己每天往你飞书私聊发一条心跳消息（不报错也发），超过 24 小时没收到心跳说明 watchdog 死了。

5. 路径黑名单

LaunchAgent 涉及的所有路径（脚本本体、日志输出、读取的文件）绝对不能在 Desktop / Documents / Downloads 下面。规避 macOS TCC 拦截。安全位置：/tmp/、~/Library/Logs/、~/.hermes/、~/.local/。

6. 加 bot 不要走 config bind

lark-cli config bind 是 replace 模式毁性命令，永远只用 profile add。这条踩过太多次，单独列。

7. 每次调用硬编码身份标识

--profile <BOT_NAME> 不是可选项，是铁律。skill / 脚本 / cron / 一切自动化都要硬编码。依赖 active 状态等于埋雷。

到这里你应该已经有一套稳定运行的多 bot Agent 系统了。后面遇到的 90% 问题都能在「常见故障速查表」里查到。剩下 10% 的偏门问题，按"先日志、再重启、最后 OAuth"的顺序排查基本不会卡超过 30 分钟。

祝你的 Agent 们各司其职，半夜不爆炸。

Hermes 多 bot 编排手把手教程 v1.0

Hermes 多 bot 编排手把手教程 v1.0

0. 前言：你要做什么 / 为什么用 Hermes

为什么这么做

Hermes / OpenClaw / 自写代码 对比简表

工程量预期

检查点 0

1. 准备工作

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 1

2. 安装 Hermes Agent

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 2

3. 在飞书开放平台创建第 1 个 bot 应用

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 3

4. 把 bot app 装进 lark-cli

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 4

5. OAuth Device Flow 让 bot 拿到 user token

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 5

6. 部署 launchd 服务让 gateway 永驻

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 6

7. 加第二个 bot（多 profile 编排）

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 7

8. 装监控 watchdog

为什么这么做

具体命令步骤

怎么验证成功

常见报错与修复

检查点 8

9. 常见故障速查表

调试黄金顺序

10. 进阶方向

lark-cli shell 调用模式的天然限制

业界主流参考

下一步：MCP server

附录：决策原则速查

Hermes / OpenClaw / 自写代码对比简表