雲原生AI部署全攻略：Docker+K8s+GPU調度

2026年，AI不上雲等於沒上

模型訓練在實驗室，推理服務在雲端——這是AI工程化的鐵律。

一個殘酷現實：你的7B模型在筆記本上跑得飛快，上了生產環境卻因為GPU調度不當、容器配置錯誤、擴縮容失靈而全線崩潰。AI部署不是寫個Dockerfile那麼簡單。

雲原生AI部署全景圖

┌─────────────────────────────────────────────────────────────────────┐
│                     雲原生AI部署全景架構                               │
├─────────────┬─────────────┬──────────────┬─────────────────────────┤
│  模型服務層  │  向量服務層  │  微服務層     │   基礎設施層            │
│             │             │              │                         │
│ vLLM/Triton │ Milvus集群  │ Spring Boot  │ K8s + GPU Operator     │
│ TGI/Ollama  │ Elastic-    │ AI Gateway   │ NVIDIA MIG + 時間分片   │
│ 模型路由     │ search      │ Ingress      │ Prometheus + Grafana    │
│ HPA擴縮容   │ Embedding   │ Service Mesh │ ArgoCD + Kustomize     │
├─────────────┴─────────────┴──────────────┴─────────────────────────┤
│                    Docker 容器化 + K8s 編排                          │
│              多階段建構 · GPU感知調度 · 彈性伸縮 · GitOps交付        │
└─────────────────────────────────────────────────────────────────────┘

AI部署三大挑戰

挑戰一：GPU稀缺——算力是硬通貨

┌──────────────────────────────────────────────────┐
│              GPU資源供需矛盾                        │
│                                                    │
│   需求：100個推理服務 × 1 GPU = 100 GPU            │
│   現實：集群只有 8 × A100 = 8 GPU                  │
│   缺口：92 GPU ❌                                  │
│                                                    │
│   ┌──────┐  MIG分片   ┌──────────────┐            │
│   │ A100 │ ────────→  │ 7 × MIG實例  │            │
│   │ 80GB │  時間分片   │ + 時間復用    │            │
│   └──────┘ ────────→  │ = 14+ 邏輯GPU │            │
│                        └──────────────┘            │
└──────────────────────────────────────────────────┘

GPU型號	顯存	MIG分片數	適合模型規模	單卡成本(月)
A100 80GB	80GB	7 × 10GB	7B-13B	¥15,000
A100 40GB	40GB	4 × 10GB	7B	¥10,000
H100 80GB	80GB	7 × 10GB	7B-70B	¥25,000
L40S 48GB	48GB	不支援MIG	7B-13B	¥6,000
T4 16GB	16GB	不支援MIG	1B-3B	¥2,000

挑戰二：模型體積爆炸——儲存與傳輸的噩夢

模型	參數量	FP16體積	INT8體積	INT4體積
Qwen2.5-7B	7B	14GB	7GB	3.5GB
Llama3.1-13B	13B	26GB	13GB	6.5GB
Qwen2.5-72B	72B	144GB	72GB	36GB
DeepSeek-V3-671B	671B	1.3TB	671GB	335GB

模型拉取時間：72B模型FP16從Docker Registry拉取需要30分鐘+，INT4量化後僅需8分鐘。

挑戰三：延遲與吞吐矛盾——魚與熊掌不可兼得

┌─────────────────────────────────────────────────────┐
│           延遲 vs 吞吐 權衡曲線                       │
│                                                       │
│  吞吐 │         ╱                                     │
│  (req/│        ╱                                      │
│  sec) │       ╱    ← 批處理增大，吞吐提升               │
│       │      ╱                                        │
│       │     ╱                                         │
│       │    ╱                                          │
│       │───╱────────────────────────                   │
│       │  ╱    ← 但延遲也增大！                         │
│       └─────────────────────── 延遲(ms)               │
│                                                       │
│  最優策略：動態批處理 + Continuous Batching             │
└─────────────────────────────────────────────────────┘

策略	延遲	吞吐	適用場景
單請求串行	最低	最低	即時對話
靜態批處理	中等	中等	離線推理
Continuous Batching	低	高	線上服務
Speculative Decoding	更低	高	即時+高吞吐

AI模型服務化：四大框架對比

vLLM vs Triton vs TGI vs Ollama

維度	vLLM	Triton Inference Server	TGI (Text Generation Inference)	Ollama
開發方	UC Berkeley	NVIDIA	HuggingFace	自社群
核心優勢	PagedAttention	多框架支援	Flash Attention	極簡部署
Continuous Batching	✅ 原生	✅ 動態批處理	✅ 原生	❌
Tensor Parallel	✅	✅	✅	❌
串流輸出	✅ SSE	✅	✅ SSE	✅
OpenAI相容API	✅	需適配	✅	✅
量化支援	AWQ/GPTQ/GGUF	INT8/FP8	AWQ/GPTQ/bitsandbytes	GGUF
GPU利用率	90%+	80%+	85%+	60%+
K8s友好度	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
生產就緒	✅	✅	✅	⚠️ 開發/測試
多模型服務	單實例單模型	單實例多模型	單實例單模型	單實例多模型
模型熱載入	❌	✅	❌	✅

架構對比圖

┌─────────────────────────────────────────────────────────────────┐
│                      vLLM 架構                                   │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Request  │───→│ Scheduler    │───→│ PagedAttention  │        │
│  │ Queue    │    │ (Continuous  │    │ (KV Block Mgmt) │        │
│  │          │    │  Batching)   │    │                 │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    OpenAI API    Dynamic Batch         GPU HBM                  │
│    Compatible    Size Control          KV Cache                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Triton 架構                                   │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Model    │───→│ Dynamic      │───→│ Backend         │        │
│  │ Repo     │    │ Batcher     │    │ (PyTorch/TF/    │        │
│  │          │    │              │    │  ONRT/TensorRT) │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    多模型配置    優先級佇列          多框架推理引擎              │
└─────────────────────────────────────────────────────────────────┘

vLLM快速啟動

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    quantization="awq",
)

params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048,
)

outputs = llm.generate(["解釋雲原生AI部署的核心挑戰"], params)
for output in outputs:
    print(output.outputs[0].text)

TGI啟動範例

docker run --gpus all -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id Qwen/Qwen2.5-7B-Instruct \
  --quantize awq \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --cuda-memory-fraction 0.9

選型建議

┌──────────────────────────────────────────────────────┐
│              模型服務框架選型決策樹                      │
│                                                        │
│  需要GPU極致利用率？                                    │
│    ├─ 是 → vLLM (PagedAttention, 90%+ 利用率)         │
│    └─ 否 ↓                                             │
│  需要多框架/多模型共存？                                │
│    ├─ 是 → Triton (單實例多模型)                       │
│    └─ 否 ↓                                             │
│  需要HuggingFace生態無縫整合？                          │
│    ├─ 是 → TGI (Flash Attention + HF Hub)             │
│    └─ 否 ↓                                             │
│  只是本地開發/測試？                                    │
│    └─ 是 → Ollama (一條命令啟動)                       │
└──────────────────────────────────────────────────────┘

Docker化AI應用：多階段建構 + Compose

多階段Dockerfile

# ===== Stage 1: 模型下載 =====
FROM python:3.11-slim AS model-downloader

RUN pip install --no-cache-dir huggingface-hub

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ARG QUANTIZATION=awq

RUN huggingface-cli download \
    ${MODEL_ID} \
    --local-dir /models/${MODEL_ID} \
    --exclude "*.safetensors" \
    && echo "Model config downloaded"

# ===== Stage 2: 推理執行時 =====
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3-pip python3.11-venv \
    && rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/vllm
ENV PATH="/opt/vllm/bin:${PATH}"

RUN pip install --no-cache-dir \
    vllm==0.8.0 \
    transformers>=4.45.0 \
    accelerate

COPY --from=model-downloader /models /models

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ENV MODEL_ID=${MODEL_ID}

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/Qwen/Qwen2.5-7B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "2", \
     "--gpu-memory-utilization", "0.9", \
     "--max-model-len", "8192"]

Docker Compose本地開發環境

version: "3.8"

services:
  vllm-server:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        MODEL_ID: Qwen/Qwen2.5-7B-Instruct
        QUANTIZATION: awq
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    volumes:
      - model-cache:/root/.cache/huggingface
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: unless-stopped

  milvus-standalone:
    image: milvusdb/milvus:v2.4.17
    ports:
      - "19530:19530"
      - "9091:9091"
    environment:
      - ETCD_USE_EMBED=true
      - COMMON_STORAGETYPE=local
    volumes:
      - milvus-data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 15s
      timeout: 5s
      retries: 3

  elasticsearch:
    image: elasticsearch:8.15.0
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es-data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health"]
      interval: 15s
      timeout: 5s
      retries: 3

  springboot-ai:
    build:
      context: ../springboot-ai-service
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - SPRING_AI_OPENAI_BASE-URL=http://vllm-server:8000/v1
      - SPRING_AI_OPENAI_API-KEY=not-needed
      - MILVUS_HOST=milvus-standalone
      - MILVUS_PORT=19530
      - ELASTICSEARCH_URIS=http://elasticsearch:9200
    depends_on:
      vllm-server:
        condition: service_healthy
      milvus-standalone:
        condition: service_healthy
      elasticsearch:
        condition: service_healthy

volumes:
  model-cache:
  milvus-data:
  es-data:

Docker映像檔最佳化技巧

最佳化手段	效果	範例
多階段建構	映像檔體積減少60%+	編譯工具不進執行時
模型量化	體積減少50%-75%	FP16→INT4
.dockerignore	建構上下文減少	排除.git, pycache
層快取復用	建構時間減少80%	依賴層不變則復用
基礎映像檔選擇	體積差異10倍	slim vs full
模型外掛Volume	映像檔不含模型資料	執行時掛載

K8s GPU調度：NVIDIA GPU Operator + MIG + 時間分片

NVIDIA GPU Operator安裝

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: gpu-operator
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  valuesContent: |
    driver:
      enabled: true
      version: "550.54.15"
    toolkit:
      enabled: true
      version: "v1.15.0"
    devicePlugin:
      enabled: true
      config:
        name: mig-config
        shared:
          timeSlicing:
            resources:
              - name: nvidia.com/gpu
                replicas: 4
    migManager:
      enabled: true
      config:
        name: mig-partition-config

MIG分區配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-partition-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 14
      mixed:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 2
            "1g.10gb": 3
        - devices: [1]
          mig-enabled: false

GPU時間分片配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  a100-80gb.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
            devices: all
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    spec:
      containers:
        - name: nvidia-device-plugin
          image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
          args:
            - --config-file=/etc/nvidia-plugin/config.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/nvidia-plugin
      volumes:
        - name: config
          configMap:
            name: time-slicing-config

Pod使用MIG實例

apiVersion: v1
kind: Pod
metadata:
  name: vllm-mig-pod
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:v0.8.0
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1
      command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - Qwen/Qwen2.5-7B-Instruct-AWQ
        - --max-model-len
        - "4096"
        - --gpu-memory-utilization
        - "0.9"

GPU調度策略對比

策略	原理	優點	缺點	適用場景
獨佔GPU	1 Pod = 1 GPU	零干擾，效能穩定	資源浪費	大模型推理
MIG分片	硬體級隔離	真正隔離，可配不同規格	僅A100/H100支援	多小模型服務
時間分片	軟體級復用	所有GPU都支援	上下文切換開銷	低優先級任務
MPS	共享GPU上下文	低開銷共享	無隔離，一方崩潰全部影響	同團隊多任務

模型推理服務部署：vLLM + HPA自動擴縮容

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b
  namespace: ai-inference
  labels:
    app: vllm-qwen7b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b
  template:
    metadata:
      labels:
        app: vllm-qwen7b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: vllm-qwen7b
                topologyKey: kubernetes.io/hostname
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 2
            requests:
              nvidia.com/gpu: 2
              cpu: "4"
              memory: 16Gi
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen2.5-7B-Instruct-AWQ"
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - Qwen/Qwen2.5-7B-Instruct-AWQ
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --tensor-parallel-size
            - "2"
            - --gpu-memory-utilization
            - "0.9"
            - --max-model-len
            - "8192"
            - --enable-prefix-caching
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
      nodeSelector:
        gpu-type: a100-80gb
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

HPA自動擴縮容

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-qwen7b-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen7b
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "10"
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage_perc
        target:
          type: AverageValue
          averageValue: "70"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Service + PodMonitor

apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen7b-svc
  namespace: ai-inference
spec:
  selector:
    app: vllm-qwen7b
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-qwen7b-monitor
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: vllm-qwen7b
  podMetricsEndpoints:
    - port: http
      path: /metrics
      interval: 15s

vLLM關鍵監控指標

指標	含義	告警閾值
`vllm:num_requests_running`	正在處理的請求數	> 80% max_num_seqs
`vllm:num_requests_waiting`	等待佇列中的請求數	> 10 持續5分鐘
`vllm:gpu_cache_usage_perc`	KV Cache使用率	> 90%
`vllm:avg_generation_throughput`	平均生成吞吐量	< 100 tok/s
`vllm:e2e_request_latency_seconds`	端到端請求延遲	P99 > 10s
`vllm:num_preemptions`	搶佔次數	> 0

RAG向量服務K8s部署：Milvus集群 + Elasticsearch

整體架構

┌──────────────────────────────────────────────────────────────────┐
│                     RAG向量服務架構                                │
│                                                                    │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐        │
│  │  使用者查詢│───→│ Spring Boot  │───→│ vLLM Embedding   │        │
│  │          │    │ AI Gateway   │    │ (向量生成)        │        │
│  └──────────┘    └──────┬───────┘    └──────────────────┘        │
│                         │                                        │
│              ┌──────────┼──────────┐                              │
│              ▼                     ▼                              │
│     ┌──────────────┐    ┌──────────────┐                         │
│     │   Milvus     │    │Elasticsearch │                         │
│     │  (向量檢索)  │    │ (全文檢索)   │                         │
│     │  HNSW/IVF   │    │  BM25+稠密   │                         │
│     └──────────────┘    └──────────────┘                         │
│            │                    │                                │
│            └────────┬───────────┘                                │
│                     ▼                                            │
│              ┌──────────────┐    ┌──────────────────┐           │
│              │  混合檢索結果 │───→│ vLLM LLM生成     │           │
│              │  Rerank重排  │    │ (答案生成)       │           │
│              └──────────────┘    └──────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Milvus集群部署

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: milvus
  namespace: ai-rag
spec:
  serviceName: milvus-headless
  replicas: 3
  selector:
    matchLabels:
      app: milvus
  template:
    metadata:
      labels:
        app: milvus
    spec:
      containers:
        - name: milvus
          image: milvusdb/milvus:v2.4.17
          ports:
            - containerPort: 19530
            - containerPort: 9091
          resources:
            requests:
              cpu: "4"
              memory: 8Gi
            limits:
              cpu: "8"
              memory: 16Gi
          env:
            - name: ETCD_ENDPOINTS
              value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
            - name: MINIO_ADDRESS
              value: "minio:9000"
            - name: COMMON_STORAGETYPE
              value: "remote"
          volumeMounts:
            - name: milvus-data
              mountPath: /var/lib/milvus
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 15
            periodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: milvus-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: milvus-svc
  namespace: ai-rag
spec:
  selector:
    app: milvus
  ports:
    - name: grpc
      port: 19530
      targetPort: 19530
    - name: metrics
      port: 9091
      targetPort: 9091
  type: ClusterIP

Elasticsearch部署

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: ai-rag
spec:
  serviceName: elasticsearch-headless
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
        - name: sysctl
          image: busybox
          command: ["sysctl", "-w", "vm.max_map_count=262144"]
          securityContext:
            privileged: true
      containers:
        - name: elasticsearch
          image: elasticsearch:8.15.0
          ports:
            - containerPort: 9200
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
          env:
            - name: cluster.name
              value: "ai-rag-cluster"
            - name: discovery.seed_hosts
              value: "elasticsearch-0.elasticsearch-headless,elasticsearch-1.elasticsearch-headless,elasticsearch-2.elasticsearch-headless"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: xpack.security.enabled
              value: "false"
            - name: ES_JAVA_OPTS
              value: "-Xms4g -Xmx4g"
          volumeMounts:
            - name: es-data
              mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
    - metadata:
        name: es-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 200Gi

Milvus Collection建立

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="http://milvus-svc.ai-rag.svc:19530")

schema = client.create_schema(auto_id=True, enable_dynamic_field=True)

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="timestamp", datatype=DataType.INT64)

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 32, "efConstruction": 256}
)
index_params.add_index(
    field_name="text",
    index_type="INVERTED"
)

client.create_collection(
    collection_name="knowledge_base",
    schema=schema,
    index_params=index_params,
)

向量檢索 vs 全文檢索 vs 混合檢索

檢索方式	原理	優勢	劣勢	適用場景
向量檢索	語意相似度	理解語意	精確匹配弱	語意問答
全文檢索	BM25關鍵詞	精確匹配	無語意理解	關鍵詞搜尋
混合檢索	向量+全文融合	兼顧語意與精確	計算開銷大	生產RAG
Rerank	交叉編碼器重排	精度最高	速度慢	高精度場景

Spring Boot AI微服務上雲

Spring Boot AI應用Dockerfile

FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY gradle/ gradle/
COPY gradlew build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon
COPY src/ src/
RUN ./gradlew bootJar --no-daemon -x test

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar

ENV JAVA_OPTS="-XX:+UseZGC -XX:MaxRAMPercentage=75.0"
ENV SPRING_PROFILES_ACTIVE=k8s

EXPOSE 8080

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD wget -qO- http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["sh", "-c", "java ${JAVA_OPTS} -jar app.jar"]

K8s Deployment + Service + Ingress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: springboot-ai-service
  namespace: ai-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: springboot-ai-service
  template:
    metadata:
      labels:
        app: springboot-ai-service
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      containers:
        - name: app
          image: registry.example.com/springboot-ai-service:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "2"
              memory: 4Gi
          env:
            - name: SPRING_AI_OPENAI_BASE-URL
              value: "http://vllm-qwen7b-svc.ai-inference.svc:8000/v1"
            - name: SPRING_AI_OPENAI_API-KEY
              valueFrom:
                secretKeyRef:
                  name: ai-api-keys
                  key: openai-key
            - name: MILVUS_HOST
              value: "milvus-svc.ai-rag.svc"
            - name: MILVUS_PORT
              value: "19530"
            - name: ELASTICSEARCH_URIS
              value: "http://elasticsearch-svc.ai-rag.svc:9200"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 5
          volumeMounts:
            - name: config
              mountPath: /app/config
      volumes:
        - name: config
          configMap:
            name: springboot-ai-config
---
apiVersion: v1
kind: Service
metadata:
  name: springboot-ai-svc
  namespace: ai-app
spec:
  selector:
    app: springboot-ai-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: springboot-ai-ingress
  namespace: ai-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-AI-Service: springboot-ai";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - ai-api.example.com
      secretName: ai-api-tls
  rules:
    - host: ai-api.example.com
      http:
        paths:
          - path: /api/ai
            pathType: Prefix
            backend:
              service:
                name: springboot-ai-svc
                port:
                  number: 80

Spring Boot AI核心配置

spring:
  ai:
    openai:
      base-url: http://vllm-qwen7b-svc.ai-inference.svc:8000/v1
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: Qwen/Qwen2.5-7B-Instruct-AWQ
          temperature: 0.7
          max-tokens: 2048
    vectorstore:
      milvus:
        client:
          host: ${MILVUS_HOST:milvus-svc.ai-rag.svc}
          port: ${MILVUS_PORT:19530}
        database-name: default
        collection-name: knowledge_base
        embedding-dimension: 1024
        metric-type: COSINE
        index-type: HNSW

  elasticsearch:
    uris: ${ELASTICSEARCH_URIS:http://elasticsearch-svc.ai-rag.svc:9200}

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: springboot-ai-service
  tracing:
    sampling:
      probability: 1.0
  otlp:
    metrics:
      export:
        url: http://otel-collector.observability.svc:4318/v1/metrics
    tracing:
      endpoint: http://otel-collector.observability.svc:4318/v1/traces

可觀測性：Prometheus + Grafana + OpenTelemetry

監控架構

┌──────────────────────────────────────────────────────────────────┐
│                    全棧可觀測性架構                                 │
│                                                                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ vLLM     │  │ Milvus   │  │ES        │  │Spring    │        │
│  │ /metrics │  │ /metrics │  │/_prom    │  │/actuator │        │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘        │
│       │              │              │              │              │
│       └──────────────┴──────────────┴──────────────┘              │
│                            │                                      │
│                   ┌────────▼────────┐                             │
│                   │  OTel Collector │                             │
│                   │  (接收+處理+匯出)│                             │
│                   └────────┬────────┘                             │
│              ┌─────────────┼─────────────┐                        │
│              ▼             ▼             ▼                        │
│     ┌──────────────┐ ┌──────────┐ ┌──────────┐                  │
│     │  Prometheus   │ │  Jaeger  │ │  Loki    │                  │
│     │  (指標儲存)   │ │ (鏈路追蹤)│ │ (日誌)   │                  │
│     └──────┬───────┘ └──────────┘ └──────────┘                  │
│            ▼                                                      │
│     ┌──────────────┐                                              │
│     │   Grafana    │                                              │
│     │  (統一面板)  │                                              │
│     └──────────────┘                                              │
└──────────────────────────────────────────────────────────────────┘

OpenTelemetry Collector部署

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          ports:
            - containerPort: 4318
            - containerPort: 4317
          resources:
            requests:
              cpu: "500m"
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: config
              mountPath: /etc/otelcol-contrib
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
          grpc:
            endpoint: 0.0.0.0:4317
      prometheus:
        config:
          scrape_configs:
            - job_name: "vllm"
              scrape_interval: 15s
              static_configs:
                - targets: ["vllm-qwen7b-svc.ai-inference.svc:8000"]
            - job_name: "milvus"
              scrape_interval: 15s
              static_configs:
                - targets: ["milvus-svc.ai-rag.svc:9091"]

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 5s
        limit_mib: 400

    exporters:
      prometheusremotewrite:
        endpoint: http://prometheus.observability.svc:9090/api/v1/write
      otlphttp:
        endpoint: http://jaeger-collector.observability.svc:4318
      loki:
        endpoint: http://loki.observability.svc:3100/loki/api/v1/push

    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlphttp]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Prometheus告警規則

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-inference-alerts
  namespace: observability
spec:
  groups:
    - name: vllm-alerts
      rules:
        - alert: VLLMHighQueueDepth
          expr: vllm_num_requests_waiting > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM請求佇列深度過高"
            description: "實例 {{ $labels.instance }} 等待佇列 {{ $value }} 個請求"

        - alert: VLLMHighLatency
          expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 10
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "vLLM P99延遲過高"
            description: "實例 {{ $labels.instance }} P99延遲 {{ $value }}s"

        - alert: VLLMGPUCacheFull
          expr: vllm_gpu_cache_usage_perc > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM GPU快取使用率過高"
            description: "實例 {{ $labels.instance }} 快取使用率 {{ $value }}%"

        - alert: VLLMPreemptions
          expr: increase(vllm_num_preemptions[5m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "vLLM發生請求搶佔"
            description: "實例 {{ $labels.instance }} 5分鐘內搶佔 {{ $value }} 次"

    - name: milvus-alerts
      rules:
        - alert: MilvusHighQueryLatency
          expr: histogram_quantile(0.99, rate(milvus_proxy_sq_latency_bucket[5m])) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Milvus查詢延遲過高"

        - alert: MilvusInsertRateLow
          expr: rate(milvus_proxy_insert_vectors_count[5m]) < 10
          for: 10m
          labels:
            severity: info
          annotations:
            summary: "Milvus插入速率異常低"

Grafana Dashboard關鍵面板

面板	指標	視覺化型別
推理QPS	`rate(vllm_num_requests_total[1m])`	折線圖
P50/P90/P99延遲	`histogram_quantile`	折線圖
GPU快取使用率	`vllm_gpu_cache_usage_perc`	儀表盤
請求佇列深度	`vllm_num_requests_waiting`	柱狀圖
向量檢索延遲	`milvus_proxy_sq_latency`	折線圖
Spring Boot JVM	`jvm_memory_used_bytes`	面積圖
HTTP錯誤率	`rate(http_server_requests_seconds_count{status=~"5.."}[5m])`	折線圖
鏈路追蹤	Jaeger整合	表格+連結

成本最佳化：量化 + MIG復用 + 動態擴縮容 + Spot實例

模型量化對比

量化方法	精度損失	體積壓縮比	推理加速	相容性
FP16 (基線)	0%	1x	1x	全部
INT8 (PTQ)	<1%	2x	1.5x	A100/H100
INT4 (GPTQ)	1-3%	4x	1.8x	需校準資料
INT4 (AWQ)	1-2%	4x	2x	需校準資料
INT4 (GGUF)	2-5%	4x	1.5x	CPU/GPU通用
FP8	<0.5%	2x	2x+	H100原生

成本最佳化策略全景

┌──────────────────────────────────────────────────────────────────┐
│                    成本最佳化四板斧                                 │
│                                                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────┐ │
│  │ 1.模型量化   │  │ 2.MIG復用    │  │ 3.動態擴縮容 │  │4.Spot│ │
│  │              │  │              │  │              │  │實例  │ │
│  │ FP16→INT4   │  │ 1GPU→7MIG   │  │ HPA+VPA     │  │競價  │ │
│  │ 體積↓75%    │  │ 利用率↑7x   │  │ 閒時縮容    │  │成本  │ │
│  │ 推理↑2x     │  │ 成本↓85%    │  │ 成本↓40%    │  │↓70%  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────┘ │
│                                                                    │
│  綜合效果：成本降至原來的 5%-15%                                    │
└──────────────────────────────────────────────────────────────────┘

成本計算實例

方案	GPU配置	月成本	說明
基線：FP16獨佔	4 × A100 80GB	¥60,000	2副本 × 2GPU
INT4量化	2 × A100 80GB	¥30,000	2副本 × 1GPU
INT4 + MIG	1 × A100 80GB	¥15,000	1副本 × 1MIG實例
INT4 + MIG + HPA	1 × A100 80GB(閒時0.5)	¥10,000	閒時縮到1副本
INT4 + MIG + HPA + Spot	1 × A100 Spot	¥3,000	Spot折扣70%

Spot實例使用策略

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b-spot
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b-spot
  template:
    metadata:
      labels:
        app: vllm-qwen7b-spot
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: cloud.google.com/gke-preemptible
                    operator: In
                    values: ["true"]
      tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          env:
            - name: GRACEFUL_SHUTDOWN
              value: "true"
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]

量化模型載入配置

from vllm import LLM

llm_awq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    enforce_eager=True,
)

llm_gptq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    quantization="gptq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
)

GitOps：ArgoCD + Kustomize自動化交付

GitOps工作流

┌──────────────────────────────────────────────────────────────────┐
│                    GitOps 交付流水線                                │
│                                                                    │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ 開發提交 │───→│ CI建構   │───→│ 映像檔推送│───→│ Git倉庫  │  │
│  │ 程式碼變更│    │ 測試通過 │    │ Registry │    │ 清單更新 │  │
│  └──────────┘    └──────────┘    └──────────┘    └────┬─────┘  │
│                                                       │          │
│                                              ┌────────▼───────┐  │
│                                              │   ArgoCD       │  │
│                                              │  偵測到變更    │  │
│                                              │  自動Sync      │  │
│                                              └────────┬───────┘  │
│                                                       │          │
│                                    ┌──────────────────┼──────┐   │
│                                    ▼                  ▼      ▼   │
│                              ┌──────────┐   ┌──────────┐  ┌──┐ │
│                              │  Dev     │   │  Staging  │  │PR││
│                              │  環境    │   │  環境     │  │OD│ │
│                              └──────────┘   └──────────┘  └──┘ │
└──────────────────────────────────────────────────────────────────┘

Kustomize目錄結構

k8s/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── hpa.yaml
│   └── ingress.yaml
├── overlays/
│   ├── dev/
│   │   ├── kustomization.yaml
│   │   └── patch-replicas.yaml
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patch-resources.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── patch-replicas.yaml
│       ├── patch-resources.yaml
│       └── patch-hpa.yaml

Base Kustomization

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - hpa.yaml
  - ingress.yaml

commonLabels:
  app.kubernetes.io/part-of: ai-inference-platform
  app.kubernetes.io/managed-by: kustomize

images:
  - name: vllm-server
    newName: registry.example.com/vllm-openai
    newTag: v0.8.0

configMapGenerator:
  - name: vllm-config
    literals:
      - MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-AWQ
      - MAX_MODEL_LEN=8192
      - GPU_MEMORY_UTILIZATION=0.9

secretGenerator:
  - name: ai-api-keys
    literals:
      - openai-key=placeholder

Production Overlay

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ai-inference-production

resources:
  - ../../base

patches:
  - target:
      kind: Deployment
      name: vllm-qwen7b
    patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: vllm-qwen7b
      spec:
        replicas: 4
        template:
          spec:
            containers:
              - name: vllm
                resources:
                  limits:
                    nvidia.com/gpu: 2
                  requests:
                    nvidia.com/gpu: 2
                    cpu: "4"
                    memory: 16Gi

  - target:
      kind: HorizontalPodAutoscaler
      name: vllm-qwen7b-hpa
    patch: |
      apiVersion: autoscaling/v2
      kind: HorizontalPodAutoscaler
      metadata:
        name: vllm-qwen7b-hpa
      spec:
        minReplicas: 4
        maxReplicas: 16

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-inference-platform
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-deployed.slack: ai-platform-deploys
spec:
  project: ai-platform
  source:
    repoURL: https://git.example.com/ai-platform/k8s-manifests.git
    targetRevision: main
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-inference-production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 5m

ArgoCD App of Apps模式

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-platform-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/ai-platform/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ai-platform
  namespace: argocd
spec:
  description: AI推理平台專案
  sourceRepos:
    - https://git.example.com/ai-platform/*
  destinations:
    - namespace: ai-inference-*
      server: https://kubernetes.default.svc
    - namespace: ai-rag-*
      server: https://kubernetes.default.svc
    - namespace: ai-app-*
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
  namespaceResourceBlacklist:
    - group: ""
      kind: ResourceQuota

總結：雲原生AI部署檢查清單

階段	檢查項	狀態
模型服務	vLLM/Triton選型確定	☐
模型服務	量化策略(INT4/AWQ)確定	☐
Docker化	多階段Dockerfile編寫	☐
Docker化	Docker Compose本地驗證	☐
K8s GPU	GPU Operator安裝	☐
K8s GPU	MIG/時間分片配置	☐
推理部署	vLLM Deployment部署	☐
推理部署	HPA自動擴縮容配置	☐
RAG服務	Milvus集群部署	☐
RAG服務	Elasticsearch部署	☐
微服務	Spring Boot AI上雲	☐
微服務	Ingress路由配置	☐
可觀測性	Prometheus+Grafana	☐
可觀測性	OpenTelemetry鏈路追蹤	☐
可觀測性	告警規則配置	☐
成本最佳化	模型量化驗證	☐
成本最佳化	Spot實例策略	☐
GitOps	ArgoCD + Kustomize	☐
GitOps	多環境Overlay配置	☐

雲原生AI部署不是終點，而是起點。 從Docker容器化到K8s編排，從GPU調度到可觀測性，每一步都是工程化的積累。記住：沒有監控的部署就是盲飛，沒有自動化的運維就是手工活。 用GitOps把一切自動化，讓AI推理服務真正成為雲原生的一等公民。