07 · 生产 MaaS 技术栈

从"客户请求进来"到"账单出来"的完整技术链路。40 卡规模够用。

全景架构

[Internet]
    |
    v
[Cloudflare]      CDN + DDoS + Rate Limit
    |
    v
[Nginx]           TLS 卸载、限流
    |
    v
[API Gateway (FastAPI)]
   - 身份校验
   - 请求路由（model -> 后端）
   - 预扣计费
   - Prometheus 埋点
    |
    +----> M1（旗舰模型）
    +----> M2 副本 A-D（14B 主力）
    +----> M3 各模型（小型舰队）
    +----> M5（Spot 服务）
    |
    v
[结算引擎（异步）]
   - 从 vLLM 响应里拿真实 token 数
   - 更新 PostgreSQL
   - 差额结算
    |
    v
[数据存储]
   - PostgreSQL：用户 / 凭据 / 订单 / 账单
   - Redis：余额 / 限流 / 计数
   - Loki：请求日志
    |
    v
[客户门户 (Next.js)]
   - 注册 / 登录
   - 凭据管理
   - 余额 / 账单
   - 使用统计

一、组件选型

组件	选型	备选	备注
CDN / DDoS	Cloudflare	AWS CloudFront	免费版够用
反向代理	Nginx	Caddy / Traefik	老牌稳
网关	FastAPI（Python）	Kong / APISIX	自研快
数据库	PostgreSQL 16	MySQL	JSONB 存元数据方便
KV 缓存	Redis 7	KeyDB	限流 + 余额
日志	Loki + Promtail	ELK	轻量
监控	Prometheus + Grafana	Datadog	自建
支付	Stripe / Ping++		海外/国内
邮件	Resend	SendGrid	通知/发票
域名	Cloudflare	阿里云

二、网关核心逻辑

路径：POST /v1/chat/completions（OpenAI 兼容协议）

步骤：

身份：从 Authorization header 提取凭据字符串，先查 Redis 缓存拿 user_id，未命中查 PG
限流：Redis + Lua 令牌桶，按 user_id（每分钟 10 请求，每小时 500）
余额：Redis 存实时余额，不足直接拒绝（HTTP 402）
路由：body 里 "model" 字段决定路由到哪台机器
优先级：header X-Priority 值为 standard / spot
预扣：按 max_tokens × 单价预扣
代理：转发到后端 vLLM
响应：SSE streaming 透传
结算：请求完成后从 response.usage 拿真实数字，异步写 PG
差额：预扣 - 真实 = 退还给客户

关键代码结构（伪代码）：

POST /v1/chat/completions:
    creds = extract_credentials(request)
    user = redis.get("user:" + creds) or db.query_user(creds)
    if not user: return 401

    if not rate_limit_ok(user.id): return 429

    model = request.body["model"]
    priority = request.headers.get("X-Priority", "standard")

    max_tokens = request.body.get("max_tokens", 4096)
    unit_price = PRICING[model][priority]
    prepay = max_tokens * unit_price / 1000000

    if redis.get("balance:" + user.id) < prepay:
        return 402

    redis.decr("balance:" + user.id, prepay)

    backend = pick_backend(model, priority)
    response = proxy(backend, request)

    for chunk in response:
        yield chunk

    real_in = response.usage.input_tokens
    real_out = response.usage.output_tokens
    real_cost = compute_cost(model, priority, real_in, real_out)
    refund = prepay - real_cost
    redis.incr("balance:" + user.id, refund)

    async_write_billing(user.id, model, real_in, real_out, real_cost)

三、模型路由表

配置文件 routes.yaml 示例：

models:
  qwen3-72b:
    priority_standard:
      - M1-IP:8001
    priority_spot: []

  qwen2.5-14b:
    priority_standard:
      - M2-IP:8001
      - M2-IP:8002
      - M2-IP:8003
      - M2-IP:8004
    priority_spot:
      - M5-IP:8030

  qwen2.5-7b:
    priority_standard:
      - M3-IP:8013

  qwen-coder-7b:
    priority_standard:
      - M3-IP:8014

  deepseek-coder-lite:
    priority_standard:
      - M3-IP:8015

  qwen2-vl-7b:
    priority_standard:
      - M3-IP:8016

  bge-m3:
    priority_standard:
      - M3-IP:8010

路由策略：

单副本 → 直连
多副本 → least_conn 或 hash by user_id（sticky 提升 prefix cache 命中）
Health check：每 10 秒 ping 后端 /health，摘除不健康节点

四、数据模型（PostgreSQL）

核心表（字段列表）：

users：id / email / name / created_at / tier
credentials：id / user_id / hash（存 hash 不存明文）/ name / created_at / last_used_at / revoked
balances：user_id / balance_cents / free_quota_input / free_quota_output / updated_at
requests：id / user_id / creds_id / model / priority / input_tokens / output_tokens / cost_cents / duration_ms / ttft_ms / status / error_msg / created_at
billing_events：id / user_id / event_type / amount_cents / balance_after / metadata / created_at
topups：id / user_id / amount_cents / method / status / payment_id / created_at / paid_at

索引：

requests(user_id, created_at)
requests(model, created_at)
billing_events(user_id, created_at)

五、结算 Worker（异步）

从消息队列消费 billing 事件（或直接 gateway 写 PG），做双记账 + 对账。

流程：

消费一条 event
更新 PG requests 表
PG billing_events 插入流水
更新 PG balances（真实成本 vs 预扣差额）
Redis 更新余额缓存
幂等：按 request_id 去重

六、客户门户（Next.js）

页面：

首页 + 定价
文档（OpenAI 协议兼容）
邮箱登录（Magic Link）
Dashboard 概览
凭据管理
使用统计（表格 + 图表）
账单 + 充值
模型列表 + 定价

关键：

NextAuth 做邮箱登录
Supabase 或自建 PG 存用户
API 调用直接走 Gateway，不经 Next.js
静态部署 Vercel 或自己 Nginx

七、支付集成

国内：Ping++ 接微信/支付宝

Webhook 收到成功 → PG topups status = success → 加余额

海外：Stripe Checkout + Webhook

发票：

中国：接诺诺发票开票 API
海外：Stripe 自动 PDF

八、监控 + 告警

Prometheus 采集：

网关：请求数、延迟、错误率、路由分布
vLLM 各实例：/metrics 端点
主机：CPU / 内存 / GPU / 磁盘 / 网络

Grafana 大屏：

实时 QPS
P50/P99 延迟
每模型月吞吐
每模型月收入
GPU 利用率热力图
客户排名

告警到飞书：

后端实例 down
延迟异常
错误率 > 5%
余额异常波动（防滥用）

九、部署清单（8 周）

W1（基础）：

域名注册、备案、Cloudflare 接入
一台 VPS 或 M3 兼职跑：Nginx + PG + Redis + Grafana
装 Node Exporter + DCGM Exporter 到所有机器

W2（模型上线）：

M1-M3 部署 vLLM
手动测每个模型 endpoint

W3（网关）：

FastAPI 起 gateway
实现身份校验 + 路由 + 预扣
systemd 服务化

W4（数据库）：

PG 建表
Redis 缓存策略
直接 PG 结算（消息队列第 2 阶段再上）

W5（门户）：

Next.js 起简单门户
注册登录 + 凭据 + 账单
部署 Vercel 或 Nginx

W6（支付）：

Ping++ 或 Stripe 接入
充值 + 发票

W7（内测）：

5-10 个种子用户
修 bug、体验优化

W8（上线）：

全网可注册
送 100 万 token 拉新
内容营销 + 技术博客

十、关键决策 & tradeoff

为什么用 FastAPI 不用 APISIX/Kong：

上手快，代码可读
Python 生态跟 vLLM 一致
规模上来（>1000 QPS）再考虑迁移

为什么不用 K8s：

40 卡规模 K8s 太重
systemd + docker + Nginx 稳定简单
第 2 阶段（100+ 卡）再上 K8s

为什么不用 Kafka：

40 卡规模日订单几千条
PG 或 Redis Streams 够
第 2 阶段再上

为什么用 Cloudflare：

免费 DDoS
全球 CDN
Rate limit 免费一部分

十一、我的建议

别一次全建：

W1-W4 只做能收钱的最小版
W5-W6 加门户提升体验
W7-W8 完整上线

别过早优化：

40 卡 = 每天可能 100-1000 请求，不需要 K8s / Kafka / ClickHouse
单台 VPS 跑 gateway + PG + Redis 完全够
收入过百万再拆分

关键：数据完整：

每个请求必须落库（PG requests 表）
客户对账时能查历史
万一被投诉能拿出证据

07 · 生产 MaaS 技术栈

On this page