2026年,AI不上雲等於沒上
模型訓練在實驗室,推理服務在雲端——這是AI工程化的鐵律。
一個殘酷現實:你的7B模型在筆記本上跑得飛快,上了生產環境卻因為GPU調度不當、容器配置錯誤、擴縮容失靈而全線崩潰。AI部署不是寫個Dockerfile那麼簡單。
雲原生AI部署全景圖
┌─────────────────────────────────────────────────────────────────────┐
│ 雲原生AI部署全景架構 │
├─────────────┬─────────────┬──────────────┬─────────────────────────┤
│ 模型服務層 │ 向量服務層 │ 微服務層 │ 基礎設施層 │
│ │ │ │ │
│ vLLM/Triton │ Milvus集群 │ Spring Boot │ K8s + GPU Operator │
│ TGI/Ollama │ Elastic- │ AI Gateway │ NVIDIA MIG + 時間分片 │
│ 模型路由 │ search │ Ingress │ Prometheus + Grafana │
│ HPA擴縮容 │ Embedding │ Service Mesh │ ArgoCD + Kustomize │
├─────────────┴─────────────┴──────────────┴─────────────────────────┤
│ Docker 容器化 + K8s 編排 │
│ 多階段建構 · GPU感知調度 · 彈性伸縮 · GitOps交付 │
└─────────────────────────────────────────────────────────────────────┘
AI部署三大挑戰
挑戰一:GPU稀缺——算力是硬通貨
┌──────────────────────────────────────────────────┐
│ GPU資源供需矛盾 │
│ │
│ 需求:100個推理服務 × 1 GPU = 100 GPU │
│ 現實:集群只有 8 × A100 = 8 GPU │
│ 缺口:92 GPU ❌ │
│ │
│ ┌──────┐ MIG分片 ┌──────────────┐ │
│ │ A100 │ ────────→ │ 7 × MIG實例 │ │
│ │ 80GB │ 時間分片 │ + 時間復用 │ │
│ └──────┘ ────────→ │ = 14+ 邏輯GPU │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────┘
| GPU型號 |
顯存 |
MIG分片數 |
適合模型規模 |
單卡成本(月) |
| A100 80GB |
80GB |
7 × 10GB |
7B-13B |
¥15,000 |
| A100 40GB |
40GB |
4 × 10GB |
7B |
¥10,000 |
| H100 80GB |
80GB |
7 × 10GB |
7B-70B |
¥25,000 |
| L40S 48GB |
48GB |
不支援MIG |
7B-13B |
¥6,000 |
| T4 16GB |
16GB |
不支援MIG |
1B-3B |
¥2,000 |
挑戰二:模型體積爆炸——儲存與傳輸的噩夢
| 模型 |
參數量 |
FP16體積 |
INT8體積 |
INT4體積 |
| Qwen2.5-7B |
7B |
14GB |
7GB |
3.5GB |
| Llama3.1-13B |
13B |
26GB |
13GB |
6.5GB |
| Qwen2.5-72B |
72B |
144GB |
72GB |
36GB |
| DeepSeek-V3-671B |
671B |
1.3TB |
671GB |
335GB |
模型拉取時間:72B模型FP16從Docker Registry拉取需要30分鐘+,INT4量化後僅需8分鐘。
挑戰三:延遲與吞吐矛盾——魚與熊掌不可兼得
┌─────────────────────────────────────────────────────┐
│ 延遲 vs 吞吐 權衡曲線 │
│ │
│ 吞吐 │ ╱ │
│ (req/│ ╱ │
│ sec) │ ╱ ← 批處理增大,吞吐提升 │
│ │ ╱ │
│ │ ╱ │
│ │ ╱ │
│ │───╱──────────────────────── │
│ │ ╱ ← 但延遲也增大! │
│ └─────────────────────── 延遲(ms) │
│ │
│ 最優策略:動態批處理 + Continuous Batching │
└─────────────────────────────────────────────────────┘
| 策略 |
延遲 |
吞吐 |
適用場景 |
| 單請求串行 |
最低 |
最低 |
即時對話 |
| 靜態批處理 |
中等 |
中等 |
離線推理 |
| Continuous Batching |
低 |
高 |
線上服務 |
| Speculative Decoding |
更低 |
高 |
即時+高吞吐 |
AI模型服務化:四大框架對比
vLLM vs Triton vs TGI vs Ollama
| 維度 |
vLLM |
Triton Inference Server |
TGI (Text Generation Inference) |
Ollama |
| 開發方 |
UC Berkeley |
NVIDIA |
HuggingFace |
自社群 |
| 核心優勢 |
PagedAttention |
多框架支援 |
Flash Attention |
極簡部署 |
| Continuous Batching |
✅ 原生 |
✅ 動態批處理 |
✅ 原生 |
❌ |
| Tensor Parallel |
✅ |
✅ |
✅ |
❌ |
| 串流輸出 |
✅ SSE |
✅ |
✅ SSE |
✅ |
| OpenAI相容API |
✅ |
需適配 |
✅ |
✅ |
| 量化支援 |
AWQ/GPTQ/GGUF |
INT8/FP8 |
AWQ/GPTQ/bitsandbytes |
GGUF |
| GPU利用率 |
90%+ |
80%+ |
85%+ |
60%+ |
| K8s友好度 |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐ |
| 生產就緒 |
✅ |
✅ |
✅ |
⚠️ 開發/測試 |
| 多模型服務 |
單實例單模型 |
單實例多模型 |
單實例單模型 |
單實例多模型 |
| 模型熱載入 |
❌ |
✅ |
❌ |
✅ |
架構對比圖
┌─────────────────────────────────────────────────────────────────┐
│ vLLM 架構 │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Request │───→│ Scheduler │───→│ PagedAttention │ │
│ │ Queue │ │ (Continuous │ │ (KV Block Mgmt) │ │
│ │ │ │ Batching) │ │ │ │
│ └─────────┘ └──────────────┘ └─────────────────┘ │
│ ↑ ↑ ↑ │
│ OpenAI API Dynamic Batch GPU HBM │
│ Compatible Size Control KV Cache │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Triton 架構 │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Model │───→│ Dynamic │───→│ Backend │ │
│ │ Repo │ │ Batcher │ │ (PyTorch/TF/ │ │
│ │ │ │ │ │ ONRT/TensorRT) │ │
│ └─────────┘ └──────────────┘ └─────────────────┘ │
│ ↑ ↑ ↑ │
│ 多模型配置 優先級佇列 多框架推理引擎 │
└─────────────────────────────────────────────────────────────────┘
vLLM快速啟動
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
max_model_len=8192,
quantization="awq",
)
params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
)
outputs = llm.generate(["解釋雲原生AI部署的核心挑戰"], params)
for output in outputs:
print(output.outputs[0].text)
TGI啟動範例
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:2.3.0 \
--model-id Qwen/Qwen2.5-7B-Instruct \
--quantize awq \
--max-input-length 4096 \
--max-total-tokens 8192 \
--cuda-memory-fraction 0.9
選型建議
┌──────────────────────────────────────────────────────┐
│ 模型服務框架選型決策樹 │
│ │
│ 需要GPU極致利用率? │
│ ├─ 是 → vLLM (PagedAttention, 90%+ 利用率) │
│ └─ 否 ↓ │
│ 需要多框架/多模型共存? │
│ ├─ 是 → Triton (單實例多模型) │
│ └─ 否 ↓ │
│ 需要HuggingFace生態無縫整合? │
│ ├─ 是 → TGI (Flash Attention + HF Hub) │
│ └─ 否 ↓ │
│ 只是本地開發/測試? │
│ └─ 是 → Ollama (一條命令啟動) │
└──────────────────────────────────────────────────────┘
Docker化AI應用:多階段建構 + Compose
多階段Dockerfile
# ===== Stage 1: 模型下載 =====
FROM python:3.11-slim AS model-downloader
RUN pip install --no-cache-dir huggingface-hub
ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ARG QUANTIZATION=awq
RUN huggingface-cli download \
${MODEL_ID} \
--local-dir /models/${MODEL_ID} \
--exclude "*.safetensors" \
&& echo "Model config downloaded"
# ===== Stage 2: 推理執行時 =====
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3-pip python3.11-venv \
&& rm -rf /var/lib/apt/lists/*
RUN python3.11 -m venv /opt/vllm
ENV PATH="/opt/vllm/bin:${PATH}"
RUN pip install --no-cache-dir \
vllm==0.8.0 \
transformers>=4.45.0 \
accelerate
COPY --from=model-downloader /models /models
ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ENV MODEL_ID=${MODEL_ID}
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/Qwen/Qwen2.5-7B-Instruct", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--tensor-parallel-size", "2", \
"--gpu-memory-utilization", "0.9", \
"--max-model-len", "8192"]
Docker Compose本地開發環境
version: "3.8"
services:
vllm-server:
build:
context: .
dockerfile: Dockerfile
args:
MODEL_ID: Qwen/Qwen2.5-7B-Instruct
QUANTIZATION: awq
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- VLLM_WORKER_MULTIPROC_METHOD=spawn
volumes:
- model-cache:/root/.cache/huggingface
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped
milvus-standalone:
image: milvusdb/milvus:v2.4.17
ports:
- "19530:19530"
- "9091:9091"
environment:
- ETCD_USE_EMBED=true
- COMMON_STORAGETYPE=local
volumes:
- milvus-data:/var/lib/milvus
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 15s
timeout: 5s
retries: 3
elasticsearch:
image: elasticsearch:8.15.0
ports:
- "9200:9200"
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms1g -Xmx1g
volumes:
- es-data:/usr/share/elasticsearch/data
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health"]
interval: 15s
timeout: 5s
retries: 3
springboot-ai:
build:
context: ../springboot-ai-service
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
- SPRING_AI_OPENAI_BASE-URL=http://vllm-server:8000/v1
- SPRING_AI_OPENAI_API-KEY=not-needed
- MILVUS_HOST=milvus-standalone
- MILVUS_PORT=19530
- ELASTICSEARCH_URIS=http://elasticsearch:9200
depends_on:
vllm-server:
condition: service_healthy
milvus-standalone:
condition: service_healthy
elasticsearch:
condition: service_healthy
volumes:
model-cache:
milvus-data:
es-data:
Docker映像檔最佳化技巧
| 最佳化手段 |
效果 |
範例 |
| 多階段建構 |
映像檔體積減少60%+ |
編譯工具不進執行時 |
| 模型量化 |
體積減少50%-75% |
FP16→INT4 |
| .dockerignore |
建構上下文減少 |
排除.git, pycache |
| 層快取復用 |
建構時間減少80% |
依賴層不變則復用 |
| 基礎映像檔選擇 |
體積差異10倍 |
slim vs full |
| 模型外掛Volume |
映像檔不含模型資料 |
執行時掛載 |
K8s GPU調度:NVIDIA GPU Operator + MIG + 時間分片
NVIDIA GPU Operator安裝
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: gpu-operator
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
targetNamespace: gpu-operator
valuesContent: |
driver:
enabled: true
version: "550.54.15"
toolkit:
enabled: true
version: "v1.15.0"
devicePlugin:
enabled: true
config:
name: mig-config
shared:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
migManager:
enabled: true
config:
name: mig-partition-config
MIG分區配置
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-partition-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-balanced:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 14
mixed:
- devices: [0]
mig-enabled: true
mig-devices:
"2g.20gb": 2
"1g.10gb": 3
- devices: [1]
mig-enabled: false
GPU時間分片配置
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
a100-80gb.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
devices: all
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-operator
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
spec:
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
args:
- --config-file=/etc/nvidia-plugin/config.yaml
volumeMounts:
- name: config
mountPath: /etc/nvidia-plugin
volumes:
- name: config
configMap:
name: time-slicing-config
Pod使用MIG實例
apiVersion: v1
kind: Pod
metadata:
name: vllm-mig-pod
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
resources:
limits:
nvidia.com/mig-1g.10gb: 1
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- Qwen/Qwen2.5-7B-Instruct-AWQ
- --max-model-len
- "4096"
- --gpu-memory-utilization
- "0.9"
GPU調度策略對比
| 策略 |
原理 |
優點 |
缺點 |
適用場景 |
| 獨佔GPU |
1 Pod = 1 GPU |
零干擾,效能穩定 |
資源浪費 |
大模型推理 |
| MIG分片 |
硬體級隔離 |
真正隔離,可配不同規格 |
僅A100/H100支援 |
多小模型服務 |
| 時間分片 |
軟體級復用 |
所有GPU都支援 |
上下文切換開銷 |
低優先級任務 |
| MPS |
共享GPU上下文 |
低開銷共享 |
無隔離,一方崩潰全部影響 |
同團隊多任務 |
模型推理服務部署:vLLM + HPA自動擴縮容
vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b
namespace: ai-inference
labels:
app: vllm-qwen7b
spec:
replicas: 2
selector:
matchLabels:
app: vllm-qwen7b
template:
metadata:
labels:
app: vllm-qwen7b
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: vllm-qwen7b
topologyKey: kubernetes.io/hostname
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
cpu: "4"
memory: 16Gi
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-7B-Instruct-AWQ"
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- Qwen/Qwen2.5-7B-Instruct-AWQ
- --host
- "0.0.0.0"
- --port
- "8000"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.9"
- --max-model-len
- "8192"
- --enable-prefix-caching
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
nodeSelector:
gpu-type: a100-80gb
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
HPA自動擴縮容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-qwen7b-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-qwen7b
minReplicas: 2
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "10"
- type: Pods
pods:
metric:
name: vllm_gpu_cache_usage_perc
target:
type: AverageValue
averageValue: "70"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Service + PodMonitor
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen7b-svc
namespace: ai-inference
spec:
selector:
app: vllm-qwen7b
ports:
- name: http
port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: vllm-qwen7b-monitor
namespace: ai-inference
spec:
selector:
matchLabels:
app: vllm-qwen7b
podMetricsEndpoints:
- port: http
path: /metrics
interval: 15s
vLLM關鍵監控指標
| 指標 |
含義 |
告警閾值 |
vllm:num_requests_running |
正在處理的請求數 |
> 80% max_num_seqs |
vllm:num_requests_waiting |
等待佇列中的請求數 |
> 10 持續5分鐘 |
vllm:gpu_cache_usage_perc |
KV Cache使用率 |
> 90% |
vllm:avg_generation_throughput |
平均生成吞吐量 |
< 100 tok/s |
vllm:e2e_request_latency_seconds |
端到端請求延遲 |
P99 > 10s |
vllm:num_preemptions |
搶佔次數 |
> 0 |
RAG向量服務K8s部署:Milvus集群 + Elasticsearch
整體架構
┌──────────────────────────────────────────────────────────────────┐
│ RAG向量服務架構 │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ 使用者查詢│───→│ Spring Boot │───→│ vLLM Embedding │ │
│ │ │ │ AI Gateway │ │ (向量生成) │ │
│ └──────────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │
│ ┌──────────┼──────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Milvus │ │Elasticsearch │ │
│ │ (向量檢索) │ │ (全文檢索) │ │
│ │ HNSW/IVF │ │ BM25+稠密 │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ 混合檢索結果 │───→│ vLLM LLM生成 │ │
│ │ Rerank重排 │ │ (答案生成) │ │
│ └──────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Milvus集群部署
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: milvus
namespace: ai-rag
spec:
serviceName: milvus-headless
replicas: 3
selector:
matchLabels:
app: milvus
template:
metadata:
labels:
app: milvus
spec:
containers:
- name: milvus
image: milvusdb/milvus:v2.4.17
ports:
- containerPort: 19530
- containerPort: 9091
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "8"
memory: 16Gi
env:
- name: ETCD_ENDPOINTS
value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
- name: MINIO_ADDRESS
value: "minio:9000"
- name: COMMON_STORAGETYPE
value: "remote"
volumeMounts:
- name: milvus-data
mountPath: /var/lib/milvus
livenessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 15
periodSeconds: 10
volumeClaimTemplates:
- metadata:
name: milvus-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: milvus-svc
namespace: ai-rag
spec:
selector:
app: milvus
ports:
- name: grpc
port: 19530
targetPort: 19530
- name: metrics
port: 9091
targetPort: 9091
type: ClusterIP
Elasticsearch部署
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: ai-rag
spec:
serviceName: elasticsearch-headless
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: sysctl
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: elasticsearch:8.15.0
ports:
- containerPort: 9200
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
env:
- name: cluster.name
value: "ai-rag-cluster"
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch-headless,elasticsearch-1.elasticsearch-headless,elasticsearch-2.elasticsearch-headless"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: xpack.security.enabled
value: "false"
- name: ES_JAVA_OPTS
value: "-Xms4g -Xmx4g"
volumeMounts:
- name: es-data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: es-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
Milvus Collection建立
from pymilvus import MilvusClient, DataType
client = MilvusClient(uri="http://milvus-svc.ai-rag.svc:19530")
schema = client.create_schema(auto_id=True, enable_dynamic_field=True)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="timestamp", datatype=DataType.INT64)
index_params = client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="HNSW",
metric_type="COSINE",
params={"M": 32, "efConstruction": 256}
)
index_params.add_index(
field_name="text",
index_type="INVERTED"
)
client.create_collection(
collection_name="knowledge_base",
schema=schema,
index_params=index_params,
)
向量檢索 vs 全文檢索 vs 混合檢索
| 檢索方式 |
原理 |
優勢 |
劣勢 |
適用場景 |
| 向量檢索 |
語意相似度 |
理解語意 |
精確匹配弱 |
語意問答 |
| 全文檢索 |
BM25關鍵詞 |
精確匹配 |
無語意理解 |
關鍵詞搜尋 |
| 混合檢索 |
向量+全文融合 |
兼顧語意與精確 |
計算開銷大 |
生產RAG |
| Rerank |
交叉編碼器重排 |
精度最高 |
速度慢 |
高精度場景 |
Spring Boot AI微服務上雲
Spring Boot AI應用Dockerfile
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY gradle/ gradle/
COPY gradlew build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon
COPY src/ src/
RUN ./gradlew bootJar --no-daemon -x test
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar
ENV JAVA_OPTS="-XX:+UseZGC -XX:MaxRAMPercentage=75.0"
ENV SPRING_PROFILES_ACTIVE=k8s
EXPOSE 8080
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD wget -qO- http://localhost:8080/actuator/health || exit 1
ENTRYPOINT ["sh", "-c", "java ${JAVA_OPTS} -jar app.jar"]
K8s Deployment + Service + Ingress
apiVersion: apps/v1
kind: Deployment
metadata:
name: springboot-ai-service
namespace: ai-app
spec:
replicas: 3
selector:
matchLabels:
app: springboot-ai-service
template:
metadata:
labels:
app: springboot-ai-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
spec:
containers:
- name: app
image: registry.example.com/springboot-ai-service:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
env:
- name: SPRING_AI_OPENAI_BASE-URL
value: "http://vllm-qwen7b-svc.ai-inference.svc:8000/v1"
- name: SPRING_AI_OPENAI_API-KEY
valueFrom:
secretKeyRef:
name: ai-api-keys
key: openai-key
- name: MILVUS_HOST
value: "milvus-svc.ai-rag.svc"
- name: MILVUS_PORT
value: "19530"
- name: ELASTICSEARCH_URIS
value: "http://elasticsearch-svc.ai-rag.svc:9200"
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /app/config
volumes:
- name: config
configMap:
name: springboot-ai-config
---
apiVersion: v1
kind: Service
metadata:
name: springboot-ai-svc
namespace: ai-app
spec:
selector:
app: springboot-ai-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: springboot-ai-ingress
namespace: ai-app
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-AI-Service: springboot-ai";
spec:
ingressClassName: nginx
tls:
- hosts:
- ai-api.example.com
secretName: ai-api-tls
rules:
- host: ai-api.example.com
http:
paths:
- path: /api/ai
pathType: Prefix
backend:
service:
name: springboot-ai-svc
port:
number: 80
Spring Boot AI核心配置
spring:
ai:
openai:
base-url: http://vllm-qwen7b-svc.ai-inference.svc:8000/v1
api-key: ${OPENAI_API_KEY}
chat:
options:
model: Qwen/Qwen2.5-7B-Instruct-AWQ
temperature: 0.7
max-tokens: 2048
vectorstore:
milvus:
client:
host: ${MILVUS_HOST:milvus-svc.ai-rag.svc}
port: ${MILVUS_PORT:19530}
database-name: default
collection-name: knowledge_base
embedding-dimension: 1024
metric-type: COSINE
index-type: HNSW
elasticsearch:
uris: ${ELASTICSEARCH_URIS:http://elasticsearch-svc.ai-rag.svc:9200}
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
tags:
application: springboot-ai-service
tracing:
sampling:
probability: 1.0
otlp:
metrics:
export:
url: http://otel-collector.observability.svc:4318/v1/metrics
tracing:
endpoint: http://otel-collector.observability.svc:4318/v1/traces
可觀測性:Prometheus + Grafana + OpenTelemetry
監控架構
┌──────────────────────────────────────────────────────────────────┐
│ 全棧可觀測性架構 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ vLLM │ │ Milvus │ │ES │ │Spring │ │
│ │ /metrics │ │ /metrics │ │/_prom │ │/actuator │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ OTel Collector │ │
│ │ (接收+處理+匯出)│ │
│ └────────┬────────┘ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Prometheus │ │ Jaeger │ │ Loki │ │
│ │ (指標儲存) │ │ (鏈路追蹤)│ │ (日誌) │ │
│ └──────┬───────┘ └──────────┘ └──────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Grafana │ │
│ │ (統一面板) │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────┘
OpenTelemetry Collector部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.110.0
ports:
- containerPort: 4318
- containerPort: 4317
resources:
requests:
cpu: "500m"
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
volumeMounts:
- name: config
mountPath: /etc/otelcol-contrib
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
prometheus:
config:
scrape_configs:
- job_name: "vllm"
scrape_interval: 15s
static_configs:
- targets: ["vllm-qwen7b-svc.ai-inference.svc:8000"]
- job_name: "milvus"
scrape_interval: 15s
static_configs:
- targets: ["milvus-svc.ai-rag.svc:9091"]
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 5s
limit_mib: 400
exporters:
prometheusremotewrite:
endpoint: http://prometheus.observability.svc:9090/api/v1/write
otlphttp:
endpoint: http://jaeger-collector.observability.svc:4318
loki:
endpoint: http://loki.observability.svc:3100/loki/api/v1/push
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Prometheus告警規則
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-inference-alerts
namespace: observability
spec:
groups:
- name: vllm-alerts
rules:
- alert: VLLMHighQueueDepth
expr: vllm_num_requests_waiting > 20
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM請求佇列深度過高"
description: "實例 {{ $labels.instance }} 等待佇列 {{ $value }} 個請求"
- alert: VLLMHighLatency
expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 10
for: 3m
labels:
severity: critical
annotations:
summary: "vLLM P99延遲過高"
description: "實例 {{ $labels.instance }} P99延遲 {{ $value }}s"
- alert: VLLMGPUCacheFull
expr: vllm_gpu_cache_usage_perc > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM GPU快取使用率過高"
description: "實例 {{ $labels.instance }} 快取使用率 {{ $value }}%"
- alert: VLLMPreemptions
expr: increase(vllm_num_preemptions[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "vLLM發生請求搶佔"
description: "實例 {{ $labels.instance }} 5分鐘內搶佔 {{ $value }} 次"
- name: milvus-alerts
rules:
- alert: MilvusHighQueryLatency
expr: histogram_quantile(0.99, rate(milvus_proxy_sq_latency_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus查詢延遲過高"
- alert: MilvusInsertRateLow
expr: rate(milvus_proxy_insert_vectors_count[5m]) < 10
for: 10m
labels:
severity: info
annotations:
summary: "Milvus插入速率異常低"
Grafana Dashboard關鍵面板
| 面板 |
指標 |
視覺化型別 |
| 推理QPS |
rate(vllm_num_requests_total[1m]) |
折線圖 |
| P50/P90/P99延遲 |
histogram_quantile |
折線圖 |
| GPU快取使用率 |
vllm_gpu_cache_usage_perc |
儀表盤 |
| 請求佇列深度 |
vllm_num_requests_waiting |
柱狀圖 |
| 向量檢索延遲 |
milvus_proxy_sq_latency |
折線圖 |
| Spring Boot JVM |
jvm_memory_used_bytes |
面積圖 |
| HTTP錯誤率 |
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) |
折線圖 |
| 鏈路追蹤 |
Jaeger整合 |
表格+連結 |
成本最佳化:量化 + MIG復用 + 動態擴縮容 + Spot實例
模型量化對比
| 量化方法 |
精度損失 |
體積壓縮比 |
推理加速 |
相容性 |
| FP16 (基線) |
0% |
1x |
1x |
全部 |
| INT8 (PTQ) |
<1% |
2x |
1.5x |
A100/H100 |
| INT4 (GPTQ) |
1-3% |
4x |
1.8x |
需校準資料 |
| INT4 (AWQ) |
1-2% |
4x |
2x |
需校準資料 |
| INT4 (GGUF) |
2-5% |
4x |
1.5x |
CPU/GPU通用 |
| FP8 |
<0.5% |
2x |
2x+ |
H100原生 |
成本最佳化策略全景
┌──────────────────────────────────────────────────────────────────┐
│ 成本最佳化四板斧 │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────┐ │
│ │ 1.模型量化 │ │ 2.MIG復用 │ │ 3.動態擴縮容 │ │4.Spot│ │
│ │ │ │ │ │ │ │實例 │ │
│ │ FP16→INT4 │ │ 1GPU→7MIG │ │ HPA+VPA │ │競價 │ │
│ │ 體積↓75% │ │ 利用率↑7x │ │ 閒時縮容 │ │成本 │ │
│ │ 推理↑2x │ │ 成本↓85% │ │ 成本↓40% │ │↓70% │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────┘ │
│ │
│ 綜合效果:成本降至原來的 5%-15% │
└──────────────────────────────────────────────────────────────────┘
成本計算實例
| 方案 |
GPU配置 |
月成本 |
說明 |
| 基線:FP16獨佔 |
4 × A100 80GB |
¥60,000 |
2副本 × 2GPU |
| INT4量化 |
2 × A100 80GB |
¥30,000 |
2副本 × 1GPU |
| INT4 + MIG |
1 × A100 80GB |
¥15,000 |
1副本 × 1MIG實例 |
| INT4 + MIG + HPA |
1 × A100 80GB(閒時0.5) |
¥10,000 |
閒時縮到1副本 |
| INT4 + MIG + HPA + Spot |
1 × A100 Spot |
¥3,000 |
Spot折扣70% |
Spot實例使用策略
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b-spot
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-qwen7b-spot
template:
metadata:
labels:
app: vllm-qwen7b-spot
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: cloud.google.com/gke-preemptible
operator: In
values: ["true"]
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
env:
- name: GRACEFUL_SHUTDOWN
value: "true"
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]
量化模型載入配置
from vllm import LLM
llm_awq = LLM(
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
quantization="awq",
gpu_memory_utilization=0.85,
max_model_len=4096,
enforce_eager=True,
)
llm_gptq = LLM(
model="Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
quantization="gptq",
gpu_memory_utilization=0.85,
max_model_len=4096,
)
GitOps:ArgoCD + Kustomize自動化交付
GitOps工作流
┌──────────────────────────────────────────────────────────────────┐
│ GitOps 交付流水線 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 開發提交 │───→│ CI建構 │───→│ 映像檔推送│───→│ Git倉庫 │ │
│ │ 程式碼變更│ │ 測試通過 │ │ Registry │ │ 清單更新 │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌────────▼───────┐ │
│ │ ArgoCD │ │
│ │ 偵測到變更 │ │
│ │ 自動Sync │ │
│ └────────┬───────┘ │
│ │ │
│ ┌──────────────────┼──────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──┐ │
│ │ Dev │ │ Staging │ │PR││
│ │ 環境 │ │ 環境 │ │OD│ │
│ └──────────┘ └──────────┘ └──┘ │
└──────────────────────────────────────────────────────────────────┘
Kustomize目錄結構
k8s/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── hpa.yaml
│ └── ingress.yaml
├── overlays/
│ ├── dev/
│ │ ├── kustomization.yaml
│ │ └── patch-replicas.yaml
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── patch-resources.yaml
│ └── production/
│ ├── kustomization.yaml
│ ├── patch-replicas.yaml
│ ├── patch-resources.yaml
│ └── patch-hpa.yaml
Base Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yaml
- ingress.yaml
commonLabels:
app.kubernetes.io/part-of: ai-inference-platform
app.kubernetes.io/managed-by: kustomize
images:
- name: vllm-server
newName: registry.example.com/vllm-openai
newTag: v0.8.0
configMapGenerator:
- name: vllm-config
literals:
- MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-AWQ
- MAX_MODEL_LEN=8192
- GPU_MEMORY_UTILIZATION=0.9
secretGenerator:
- name: ai-api-keys
literals:
- openai-key=placeholder
Production Overlay
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ai-inference-production
resources:
- ../../base
patches:
- target:
kind: Deployment
name: vllm-qwen7b
patch: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b
spec:
replicas: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
cpu: "4"
memory: 16Gi
- target:
kind: HorizontalPodAutoscaler
name: vllm-qwen7b-hpa
patch: |
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-qwen7b-hpa
spec:
minReplicas: 4
maxReplicas: 16
ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ai-inference-platform
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-deployed.slack: ai-platform-deploys
spec:
project: ai-platform
source:
repoURL: https://git.example.com/ai-platform/k8s-manifests.git
targetRevision: main
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: ai-inference-production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 3
backoff:
duration: 30s
factor: 2
maxDuration: 5m
ArgoCD App of Apps模式
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ai-platform-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://git.example.com/ai-platform/argocd-apps.git
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: ai-platform
namespace: argocd
spec:
description: AI推理平台專案
sourceRepos:
- https://git.example.com/ai-platform/*
destinations:
- namespace: ai-inference-*
server: https://kubernetes.default.svc
- namespace: ai-rag-*
server: https://kubernetes.default.svc
- namespace: ai-app-*
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: ""
kind: Namespace
namespaceResourceBlacklist:
- group: ""
kind: ResourceQuota
總結:雲原生AI部署檢查清單
| 階段 |
檢查項 |
狀態 |
| 模型服務 |
vLLM/Triton選型確定 |
☐ |
| 模型服務 |
量化策略(INT4/AWQ)確定 |
☐ |
| Docker化 |
多階段Dockerfile編寫 |
☐ |
| Docker化 |
Docker Compose本地驗證 |
☐ |
| K8s GPU |
GPU Operator安裝 |
☐ |
| K8s GPU |
MIG/時間分片配置 |
☐ |
| 推理部署 |
vLLM Deployment部署 |
☐ |
| 推理部署 |
HPA自動擴縮容配置 |
☐ |
| RAG服務 |
Milvus集群部署 |
☐ |
| RAG服務 |
Elasticsearch部署 |
☐ |
| 微服務 |
Spring Boot AI上雲 |
☐ |
| 微服務 |
Ingress路由配置 |
☐ |
| 可觀測性 |
Prometheus+Grafana |
☐ |
| 可觀測性 |
OpenTelemetry鏈路追蹤 |
☐ |
| 可觀測性 |
告警規則配置 |
☐ |
| 成本最佳化 |
模型量化驗證 |
☐ |
| 成本最佳化 |
Spot實例策略 |
☐ |
| GitOps |
ArgoCD + Kustomize |
☐ |
| GitOps |
多環境Overlay配置 |
☐ |
雲原生AI部署不是終點,而是起點。 從Docker容器化到K8s編排,從GPU調度到可觀測性,每一步都是工程化的積累。記住:沒有監控的部署就是盲飛,沒有自動化的運維就是手工活。 用GitOps把一切自動化,讓AI推理服務真正成為雲原生的一等公民。