Gateway Runbook（中文运维版）

核心职责

Gateway 是 OpenClaw 的控制面与路由面，负责会话、渠道、工具策略和 API 接入。

参考：

日常命令

bash

# 状态检查
openclaw gateway status           # 基础状态
openclaw gateway status --deep    # 深度检查
openclaw status                   # 整体状态

# 日志查看
openclaw logs --follow            # 实时日志
openclaw logs --filter error      # 错误日志
openclaw logs --tail 100          # 最近 100 条

# 系统诊断
openclaw doctor                   # 系统诊断
openclaw health                   # 健康检查

五分钟健康检查

检查流程

bash

#!/bin/bash
# health-check.sh

echo "=== OpenClaw Gateway Health Check ==="
echo "Time: $(date)"
echo ""

# 1. 网关状态
echo "1. Gateway Status:"
openclaw gateway status --json | jq '{
  status: .status,
  uptime: .uptime,
  version: .version
}'

# 2. 运行时状态
echo ""
echo "2. Runtime Status:"
openclaw status --json | jq '{
  runtime: .runtime.status,
  channels: .channels.active,
  providers: .providers.available
}'

# 3. 渠道状态
echo ""
echo "3. Channels Status:"
openclaw channels status --probe --json | jq '.[] | {
  name: .name,
  status: .status,
  latency_ms: .latency_ms
}'

# 4. 错误检查
echo ""
echo "4. Error Check (last 1h):"
ERROR_COUNT=$(openclaw logs --filter error --since 1h | wc -l)
echo "Errors in last hour: $ERROR_COUNT"

# 5. 安全检查
echo ""
echo "5. Security Audit:"
openclaw security audit --json | jq '{
  issues: .issues | length,
  critical: .issues | map(select(.severity == "critical")) | length
}'

echo ""
echo "=== Check Complete ==="

健康标准

yaml

health_criteria:
  gateway:
    status: running
    uptime: '> 1h'

  channels:
    all_active: true
    latency_ms: < 1000

  errors:
    hourly_count: < 10

  security:
    critical_issues: 0

配置与重载策略

关键配置项

yaml

# gateway.yaml
gateway:
  # 网络配置
  host: 127.0.0.1
  port: 18789

  # 认证配置
  auth:
    enabled: true
    type: token
    token: ${OPENCLAW_GATEWAY_TOKEN}

  # 重载模式
  reload:
    mode: hybrid # auto | manual | hybrid
    watch_config: true
    watch_skills: true

  # 性能配置
  performance:
    max_connections: 1000
    request_timeout: 30000
    idle_timeout: 60000

重载策略

bash

# 自动重载（配置变更自动生效）
openclaw config set gateway.reload.mode auto

# 混合重载（小变更自动，大变更手动）
openclaw config set gateway.reload.mode hybrid

# 手动重载
openclaw config reload

# 重启网关
openclaw gateway restart

重载建议

环境	重载模式	说明
开发	auto	快速迭代
测试	hybrid	平衡效率与安全
生产	manual	受控变更

远程访问建议

优先级排序

txt

1. Tailscale / VPN（推荐）
   └── 安全、简单、无需额外配置

2. SSH 隧道
   └── 临时访问、无需额外服务

3. 公网代理（必须加鉴权）
   └── 仅在必要时使用

Tailscale 配置

bash

# 安装 Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# 连接到网络
tailscale up

# 获取 Tailscale IP
tailscale ip

# 配置 OpenClaw 使用 Tailscale IP
openclaw config set gateway.host 100.x.y.z

SSH 隧道配置

bash

# 创建 SSH 隧道
ssh -N -L 18789:127.0.0.1:18789 user@remote-host

# 后台运行
ssh -fN -L 18789:127.0.0.1:18789 user@remote-host

# 使用 autossh 持久化
autossh -M 0 -f -N \
  -L 18789:127.0.0.1:18789 \
  -o ServerAliveInterval=30 \
  -o ServerAliveCountMax=3 \
  user@remote-host

# 本地访问
openclaw --api-url http://localhost:18789

公网代理配置（需谨慎）

yaml

# 仅在必要时使用
gateway:
  host: 0.0.0.0 # 暴露公网

  auth:
    enabled: true
    type: token
    token: ${SECURE_TOKEN}

  tls:
    enabled: true
    cert: /path/to/cert.pem
    key: /path/to/key.pem

  rate_limit:
    enabled: true
    requests_per_minute: 60
    burst: 10

常见故障定位顺序

故障排查流程图

txt

1. 进程是否运行
   └── 否 → 启动网关
   └── 是 → 继续

2. 端口与 bind 是否符合预期
   └── 否 → 检查配置
   └── 是 → 继续

3. 鉴权配置是否生效
   └── 否 → 检查 token 配置
   └── 是 → 继续

4. 渠道连接状态是否异常
   └── 否 → 继续
   └── 是 → 检查渠道配置

5. Provider 是否可达
   └── 否 → 检查 API 密钥和网络
   └── 是 → 深度诊断

故障排查命令

bash

# 1. 检查进程
ps aux | grep openclaw

# 2. 检查端口
lsof -i :18789
netstat -an | grep 18789

# 3. 检查配置
openclaw config list
openclaw config validate

# 4. 检查渠道
openclaw channels status --probe

# 5. 检查 Provider
openclaw providers test

# 6. 查看日志
openclaw logs --filter error --tail 50

# 7. 系统诊断
openclaw doctor

常见错误及解决方案

错误	可能原因	解决方案
Gateway not running	进程未启动	`openclaw gateway start`
Port already in use	端口被占用	修改端口或终止占用进程
Auth failed	Token 无效	检查 `gateway.auth.token`
Channel disconnected	渠道断连	重新配对渠道
Provider unreachable	API 问题	检查 API 密钥和网络
High latency	性能问题	检查负载和资源配置

值班手册模板

故障报告模板

markdown

# 故障报告

## 基本信息

- 故障级别：P0/P1/P2
- 发现时间：YYYY-MM-DD HH:MM
- 恢复时间：YYYY-MM-DD HH:MM
- 影响范围：[用户数/渠道/时长]

## 故障现象

- 用户反馈：
- 监控告警：
- 错误信息：

## 根因分析

- 直接原因：
- 根本原因：
- 触发条件：

## 处理过程

| 时间 | 动作 | 结果 |
| ---- | ---- | ---- |
|      |      |      |

## 改进措施

- [ ] 短期：[具体措施]
- [ ] 中期：[具体措施]
- [ ] 长期：[具体措施]

## 经验教训

- 做得好的：
- 需改进的：
- 新增监控：

止血动作清单

bash

# 1. 快速止血
# 限流
openclaw config set gateway.rate_limit.enabled true
openclaw config set gateway.rate_limit.requests_per_minute 30

# 禁用高风险工具
openclaw config set tools.deny '["exec:*", "browser:*"]'

# 2. 隔离问题渠道
openclaw channels disable <channel-name>

# 3. 降级服务
openclaw config set agents.defaults.model gpt-3.5-turbo

# 4. 重启网关
openclaw gateway restart

# 5. 回滚配置
openclaw config import --input config-backup.yaml

监控告警配置

yaml

# monitoring.yaml
alerts:
  - name: gateway_down
    condition: gateway_status != "running"
    severity: critical
    notify: [admin, ops]

  - name: high_error_rate
    condition: error_rate > 5%
    severity: warning
    notify: [admin]

  - name: channel_disconnected
    condition: channel_status != "active"
    severity: warning
    notify: [admin, ops]

  - name: high_latency
    condition: p99_latency > 10s
    severity: warning
    notify: [admin]

  - name: provider_unavailable
    condition: provider_status != "available"
    severity: critical
    notify: [admin, ops]

Gateway Runbook（中文运维版） ​

核心职责 ​

日常命令 ​

五分钟健康检查 ​

检查流程 ​

健康标准 ​

配置与重载策略 ​

关键配置项 ​

重载策略 ​

重载建议 ​

远程访问建议 ​

优先级排序 ​

Tailscale 配置 ​

SSH 隧道配置 ​

公网代理配置（需谨慎） ​

常见故障定位顺序 ​

故障排查流程图 ​

故障排查命令 ​

常见错误及解决方案 ​

值班手册模板 ​

故障报告模板 ​

止血动作清单 ​

监控告警配置 ​

Gateway Runbook（中文运维版）

核心职责

日常命令

五分钟健康检查

检查流程

健康标准

配置与重载策略

关键配置项

重载策略

重载建议

远程访问建议

优先级排序

Tailscale 配置

SSH 隧道配置

公网代理配置（需谨慎）

常见故障定位顺序

故障排查流程图

故障排查命令

常见错误及解决方案

值班手册模板

故障报告模板

止血动作清单

监控告警配置