Gateway Runbook(中文运维版)
核心职责
Gateway 是 OpenClaw 的控制面与路由面,负责会话、渠道、工具策略和 API 接入。
参考:
- https://docs.openclaw.ai/gateway/index.md
- https://docs.openclaw.ai/gateway/configuration.md
- https://docs.openclaw.ai/gateway/troubleshooting.md
日常命令
bash
# 状态检查
openclaw gateway status # 基础状态
openclaw gateway status --deep # 深度检查
openclaw status # 整体状态
# 日志查看
openclaw logs --follow # 实时日志
openclaw logs --filter error # 错误日志
openclaw logs --tail 100 # 最近 100 条
# 系统诊断
openclaw doctor # 系统诊断
openclaw health # 健康检查五分钟健康检查
检查流程
bash
#!/bin/bash
# health-check.sh
echo "=== OpenClaw Gateway Health Check ==="
echo "Time: $(date)"
echo ""
# 1. 网关状态
echo "1. Gateway Status:"
openclaw gateway status --json | jq '{
status: .status,
uptime: .uptime,
version: .version
}'
# 2. 运行时状态
echo ""
echo "2. Runtime Status:"
openclaw status --json | jq '{
runtime: .runtime.status,
channels: .channels.active,
providers: .providers.available
}'
# 3. 渠道状态
echo ""
echo "3. Channels Status:"
openclaw channels status --probe --json | jq '.[] | {
name: .name,
status: .status,
latency_ms: .latency_ms
}'
# 4. 错误检查
echo ""
echo "4. Error Check (last 1h):"
ERROR_COUNT=$(openclaw logs --filter error --since 1h | wc -l)
echo "Errors in last hour: $ERROR_COUNT"
# 5. 安全检查
echo ""
echo "5. Security Audit:"
openclaw security audit --json | jq '{
issues: .issues | length,
critical: .issues | map(select(.severity == "critical")) | length
}'
echo ""
echo "=== Check Complete ==="健康标准
yaml
health_criteria:
gateway:
status: running
uptime: '> 1h'
channels:
all_active: true
latency_ms: < 1000
errors:
hourly_count: < 10
security:
critical_issues: 0配置与重载策略
关键配置项
yaml
# gateway.yaml
gateway:
# 网络配置
host: 127.0.0.1
port: 18789
# 认证配置
auth:
enabled: true
type: token
token: ${OPENCLAW_GATEWAY_TOKEN}
# 重载模式
reload:
mode: hybrid # auto | manual | hybrid
watch_config: true
watch_skills: true
# 性能配置
performance:
max_connections: 1000
request_timeout: 30000
idle_timeout: 60000重载策略
bash
# 自动重载(配置变更自动生效)
openclaw config set gateway.reload.mode auto
# 混合重载(小变更自动,大变更手动)
openclaw config set gateway.reload.mode hybrid
# 手动重载
openclaw config reload
# 重启网关
openclaw gateway restart重载建议
| 环境 | 重载模式 | 说明 |
|---|---|---|
| 开发 | auto | 快速迭代 |
| 测试 | hybrid | 平衡效率与安全 |
| 生产 | manual | 受控变更 |
远程访问建议
优先级排序
txt
1. Tailscale / VPN(推荐)
└── 安全、简单、无需额外配置
2. SSH 隧道
└── 临时访问、无需额外服务
3. 公网代理(必须加鉴权)
└── 仅在必要时使用Tailscale 配置
bash
# 安装 Tailscale
curl -fsSL https://tailscale.com/install.sh | sh
# 连接到网络
tailscale up
# 获取 Tailscale IP
tailscale ip
# 配置 OpenClaw 使用 Tailscale IP
openclaw config set gateway.host 100.x.y.zSSH 隧道配置
bash
# 创建 SSH 隧道
ssh -N -L 18789:127.0.0.1:18789 user@remote-host
# 后台运行
ssh -fN -L 18789:127.0.0.1:18789 user@remote-host
# 使用 autossh 持久化
autossh -M 0 -f -N \
-L 18789:127.0.0.1:18789 \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
user@remote-host
# 本地访问
openclaw --api-url http://localhost:18789公网代理配置(需谨慎)
yaml
# 仅在必要时使用
gateway:
host: 0.0.0.0 # 暴露公网
auth:
enabled: true
type: token
token: ${SECURE_TOKEN}
tls:
enabled: true
cert: /path/to/cert.pem
key: /path/to/key.pem
rate_limit:
enabled: true
requests_per_minute: 60
burst: 10常见故障定位顺序
故障排查流程图
txt
1. 进程是否运行
└── 否 → 启动网关
└── 是 → 继续
2. 端口与 bind 是否符合预期
└── 否 → 检查配置
└── 是 → 继续
3. 鉴权配置是否生效
└── 否 → 检查 token 配置
└── 是 → 继续
4. 渠道连接状态是否异常
└── 否 → 继续
└── 是 → 检查渠道配置
5. Provider 是否可达
└── 否 → 检查 API 密钥和网络
└── 是 → 深度诊断故障排查命令
bash
# 1. 检查进程
ps aux | grep openclaw
# 2. 检查端口
lsof -i :18789
netstat -an | grep 18789
# 3. 检查配置
openclaw config list
openclaw config validate
# 4. 检查渠道
openclaw channels status --probe
# 5. 检查 Provider
openclaw providers test
# 6. 查看日志
openclaw logs --filter error --tail 50
# 7. 系统诊断
openclaw doctor常见错误及解决方案
| 错误 | 可能原因 | 解决方案 |
|---|---|---|
| Gateway not running | 进程未启动 | openclaw gateway start |
| Port already in use | 端口被占用 | 修改端口或终止占用进程 |
| Auth failed | Token 无效 | 检查 gateway.auth.token |
| Channel disconnected | 渠道断连 | 重新配对渠道 |
| Provider unreachable | API 问题 | 检查 API 密钥和网络 |
| High latency | 性能问题 | 检查负载和资源配置 |
值班手册模板
故障报告模板
markdown
# 故障报告
## 基本信息
- 故障级别:P0/P1/P2
- 发现时间:YYYY-MM-DD HH:MM
- 恢复时间:YYYY-MM-DD HH:MM
- 影响范围:[用户数/渠道/时长]
## 故障现象
- 用户反馈:
- 监控告警:
- 错误信息:
## 根因分析
- 直接原因:
- 根本原因:
- 触发条件:
## 处理过程
| 时间 | 动作 | 结果 |
| ---- | ---- | ---- |
| | | |
## 改进措施
- [ ] 短期:[具体措施]
- [ ] 中期:[具体措施]
- [ ] 长期:[具体措施]
## 经验教训
- 做得好的:
- 需改进的:
- 新增监控:止血动作清单
bash
# 1. 快速止血
# 限流
openclaw config set gateway.rate_limit.enabled true
openclaw config set gateway.rate_limit.requests_per_minute 30
# 禁用高风险工具
openclaw config set tools.deny '["exec:*", "browser:*"]'
# 2. 隔离问题渠道
openclaw channels disable <channel-name>
# 3. 降级服务
openclaw config set agents.defaults.model gpt-3.5-turbo
# 4. 重启网关
openclaw gateway restart
# 5. 回滚配置
openclaw config import --input config-backup.yaml监控告警配置
yaml
# monitoring.yaml
alerts:
- name: gateway_down
condition: gateway_status != "running"
severity: critical
notify: [admin, ops]
- name: high_error_rate
condition: error_rate > 5%
severity: warning
notify: [admin]
- name: channel_disconnected
condition: channel_status != "active"
severity: warning
notify: [admin, ops]
- name: high_latency
condition: p99_latency > 10s
severity: warning
notify: [admin]
- name: provider_unavailable
condition: provider_status != "available"
severity: critical
notify: [admin, ops]