K8s Sidecar AI推理代理:從流量攔截到模型路由的7種生產模式
你的AI推理服務還在裸跑嗎?每次模型升級都要改業務程式碼?多模型版本共存時流量管理一團糟?GPU資源利用率不到30%卻還在不斷擴容?2026年了,Kubernetes Sidecar模式已經成為AI推理代理的標準架構,是時候用Sidecar容器解耦推理邏輯和業務邏輯了。
核心收穫
- 理解K8s Sidecar AI推理代理的架構原理和適用場景
- 掌握7種生產級Sidecar代理模式的完整實現
- 學會流量攔截、模型路由、A/B測試的YAML配置
- 瞭解GPU資源池化、批處理合併等進階優化技巧
- 規避5個常見坑和10個高頻報錯
目錄
- Sidecar AI代理架構全景
- 模式1:Envoy流量攔截與重寫
- 模式2:智慧模型路由
- 模式3:A/B測試與灰度發布
- 模式4:多模型服務與版本管理
- 模式5:GPU資源池化管理
- 模式6:批處理與請求合併
- 模式7:可觀測性與鏈路追蹤
- 5個常見坑及解決方案
- 10個常見報錯排查
- 進階優化技巧
- 對比分析:Sidecar代理 vs Service Mesh vs Gateway
- 線上工具推薦
Sidecar AI代理架構全景
為什麼AI推理需要Sidecar代理?
傳統AI推理部署將模型載入、推理執行、流量管理全部耦合在業務容器中。當模型版本迭代、路由策略變更、資源限制調整時,必須重新建構和部署整個服務。Sidecar代理模式將關注點分離:
┌─────────────────────────────────────────────────┐
│ Pod │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Business │ │ Sidecar Proxy │ │
│ │ Container │─────▶│ (AI Inference) │ │
│ │ │ │ │ │
│ │ - API邏輯 │ │ - 流量攔截 │ │
│ │ - 業務處理 │ │ - 模型路由 │ │
│ │ - 結果聚合 │ │ - 負載均衡 │ │
│ │ │ │ - 批處理合併 │ │
│ │ :8080 │ │ - 指標採集 │ │
│ └──────────────┘ │ │ │
│ │ │ :15001(inbound) │ │
│ │ │ :15006(outbound) │ │
│ ▼ └──────────────────────┘ │
│ ┌──────────────┐ │ │
│ │ Model │◀──────────────┘ │
│ │ Server │ │
│ │ (vLLM/Triton│ │
│ │ /Ollama) │ │
│ │ :8000 │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────┘
Sidecar代理核心職責
| 職責 | 說明 | 收益 |
|---|---|---|
| 流量攔截 | 攔截業務容器的推理請求 | 業務程式碼零修改 |
| 模型路由 | 根據請求特徵路由到不同模型 | 多模型版本共存 |
| 負載均衡 | 在多副本間分配推理請求 | 提高吞吐量 |
| 批處理 | 合併多個請求批量推理 | GPU利用率提升3-5x |
| 熔斷降級 | 模型服務異常時快速降級 | 保護系統穩定性 |
| 可觀測性 | 採集推理延遲、吞吐等指標 | 全鏈路可觀測 |
模式1:Envoy流量攔截與重寫
架構原理
Envoy作為Sidecar代理,透過iptables攔截業務容器的出站流量,將推理請求重寫到目標模型服務。這是K8s Sidecar模式中最經典的流量攔截方式。
Client Request
│
▼
┌─────────┐ iptables ┌──────────────┐
│ Business │───redirect───▶│ Envoy │
│ Container│ :8080 │ Sidecar │
│ │ │ :15001 │
└─────────┘ └──────┬───────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Model A │ │ Model B │ │ Model C │
│ :8000 │ │ :8001 │ │ :8002 │
└─────────┘ └─────────┘ └─────────┘
完整配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-app
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: ai-inference
template:
metadata:
labels:
app: ai-inference
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/includeOutboundIPRanges: "10.0.0.0/8"
traffic.sidecar.istio.io/excludeInboundPorts: "9090"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: INFERENCE_ENDPOINT
value: "http://localhost:15001/v1/completions"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
- name: model-server
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-72B-Instruct"
- name: GPU_MEMORY_UTILIZATION
value: "0.9"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Envoy重寫規則
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-rewrite
namespace: ai-serving
spec:
hosts:
- inference-internal
http:
- match:
- uri:
prefix: "/v1/completions"
- headers:
x-model-type:
exact: "chat"
rewrite:
uri: "/v1/chat/completions"
route:
- destination:
host: model-server
port:
number: 8000
- match:
- uri:
prefix: "/v1/embeddings"
route:
- destination:
host: embedding-server
port:
number: 8001
模式2:智慧模型路由
基於請求特徵的動態路由
AI推理場景下,不同請求可能需要路由到不同規格的模型。Sidecar代理可以根據請求頭、payload內容、token數量等特徵進行智慧路由。
┌──────────────────────────────────────────────────────┐
│ Smart Model Router │
│ │
│ Request ──▶ [Token Counter] ──▶ [Model Selector] │
│ │ │ │
│ │ ┌──────────────┼──────────┐ │
│ │ ▼ ▼ ▼ │
│ │ ┌───────┐ ┌─────────┐ ┌──────┐ │
│ │ │Small │ │Medium │ │Large │ │
│ │ │Model │ │Model │ │Model │ │
│ │ │<1K tok│ │1K-8K tok│ │>8K │ │
│ │ │Qwen2.5│ │Qwen2.5 │ │Qwen2.│ │
│ │ │-7B │ │-32B │ │5-72B │ │
│ │ └───────┘ └─────────┘ └──────┘ │
└──────────────────────────────────────────────────────┘
路由配置
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: smart-model-route
namespace: ai-serving
spec:
hosts:
- ai-router
http:
- match:
- headers:
x-token-range:
exact: "small"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-7b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "medium"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-32b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "large"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-72b-service
port:
number: 8000
weight: 100
- route:
- destination:
host: qwen-32b-service
port:
number: 8000
Go路由代理實現
package main
import (
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httputil"
"net/url"
"strings"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.uber.org/zap"
)
type ModelRouteConfig struct {
SmallModelEndpoint string `json:"smallModelEndpoint"`
MediumModelEndpoint string `json:"mediumModelEndpoint"`
LargeModelEndpoint string `json:"largeModelEndpoint"`
SmallTokenThreshold int `json:"smallTokenThreshold"`
LargeTokenThreshold int `json:"largeTokenThreshold"`
}
type InferenceRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages,omitempty"`
Prompt string `json:"prompt,omitempty"`
MaxTokens int `json:"max_tokens,omitempty"`
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type SmartRouter struct {
config *ModelRouteConfig
logger *zap.Logger
metrics *RouterMetrics
}
type RouterMetrics struct {
routeDecision *prometheus.CounterVec
requestLatency *prometheus.HistogramVec
}
func NewSmartRouter(cfg *ModelRouteConfig, logger *zap.Logger) *SmartRouter {
metrics := &RouterMetrics{
routeDecision: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "ai_router_route_decision_total",
Help: "Model route decision counter",
},
[]string{"model_size", "model_name"},
),
requestLatency: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "ai_router_request_latency_seconds",
Help: "Request routing latency",
Buckets: prometheus.DefBuckets,
},
[]string{"model_size"},
),
}
prometheus.MustRegister(metrics.routeDecision, metrics.requestLatency)
return &SmartRouter{config: cfg, logger: logger, metrics: metrics}
}
func (r *SmartRouter) estimateTokenCount(req *InferenceRequest) int {
totalChars := 0
if req.Prompt != "" {
totalChars += len(req.Prompt)
}
for _, msg := range req.Messages {
totalChars += len(msg.Content)
}
return totalChars / 4
}
func (r *SmartRouter) selectModel(tokenCount int) (string, string) {
if tokenCount <= r.config.SmallTokenThreshold {
return "small", r.config.SmallModelEndpoint
}
if tokenCount <= r.config.LargeTokenThreshold {
return "medium", r.config.MediumModelEndpoint
}
return "large", r.config.LargeModelEndpoint
}
func (r *SmartRouter) ServeHTTP(w http.ResponseWriter, req *http.Request) {
start := time.Now()
body, err := io.ReadAll(req.Body)
if err != nil {
http.Error(w, "failed to read request body", http.StatusBadRequest)
return
}
defer req.Body.Close()
var inferReq InferenceRequest
if err := json.Unmarshal(body, &inferReq); err != nil {
http.Error(w, "invalid request format", http.StatusBadRequest)
return
}
tokenCount := r.estimateTokenCount(&inferReq)
modelSize, endpoint := r.selectModel(tokenCount)
r.logger.Info("routing decision", zap.Int("token_count", tokenCount), zap.String("model_size", modelSize))
r.metrics.routeDecision.WithLabelValues(modelSize, inferReq.Model).Inc()
target, _ := url.Parse(endpoint)
proxy := httputil.NewSingleHostReverseProxy(target)
req.Body = io.NopCloser(strings.NewReader(string(body)))
req.ContentLength = int64(len(body))
proxy.ServeHTTP(w, req)
r.metrics.requestLatency.WithLabelValues(modelSize).Observe(time.Since(start).Seconds())
}
func main() {
logger, _ := zap.NewProduction()
defer logger.Sync()
cfg := &ModelRouteConfig{
SmallModelEndpoint: "http://qwen-7b-service:8000",
MediumModelEndpoint: "http://qwen-32b-service:8000",
LargeModelEndpoint: "http://qwen-72b-service:8000",
SmallTokenThreshold: 1000,
LargeTokenThreshold: 8000,
}
router := NewSmartRouter(cfg, logger)
mux := http.NewServeMux()
mux.Handle("/v1/", router)
mux.Handle("/metrics", promhttp.Handler())
mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) { fmt.Fprint(w, "ok") })
server := &http.Server{Addr: ":15001", Handler: mux, WriteTimeout: 120 * time.Second}
logger.Info("smart router starting", zap.String("addr", server.Addr))
server.ListenAndServe()
}
模式3:A/B測試與灰度發布
基於權重的灰度發布
AI模型上線時,灰度發布是必不可少的環節。Sidecar代理可以實現基於權重的流量分配,逐步將流量從舊模型切換到新模型。
┌────────────────────────────────────────────┐
│ Canary Deployment Flow │
│ │
│ Traffic ──▶ [Sidecar Proxy] ──┬── 90% ──▶│──▶ Model v1 (Stable)
│ │ │
│ └── 10% ──▶│──▶ Model v2 (Canary)
│ │
│ Metrics: │
│ ┌──────────────────────────────────────┐ │
│ │ v1: latency_p99=120ms error=0.1% │ │
│ │ v2: latency_p99=95ms error=0.05% │ │
│ └──────────────────────────────────────┘ │
│ │
│ Decision: Promote v2 ──▶ Shift to 50/50 │
└────────────────────────────────────────────┘
灰度發布配置
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-canary
namespace: ai-serving
spec:
hosts:
- model-service
http:
- route:
- destination:
host: model-v1
port:
number: 8000
weight: 90
- destination:
host: model-v2
port:
number: 8000
weight: 10
retries:
attempts: 3
perTryTimeout: 30s
retryOn: 5xx,reset
基於Header的A/B測試
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-ab-test
namespace: ai-serving
spec:
hosts:
- model-service
http:
- match:
- headers:
x-experiment:
exact: "model-v2-creative"
route:
- destination:
host: model-v2-creative
port:
number: 8000
- match:
- headers:
x-experiment:
exact: "model-v2-precise"
route:
- destination:
host: model-v2-precise
port:
number: 8000
- route:
- destination:
host: model-v1
port:
number: 8000
Argo Rollouts自動化灰度
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-rollout
namespace: ai-serving
spec:
replicas: 4
strategy:
canary:
canaryService: model-v2
stableService: model-v1
trafficRouting:
istio:
virtualServices:
- name: model-canary
routes:
- primary
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 75
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: model-quality-check
startingStep: 2
args:
- name: canary-service
value: model-v2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-quality-check
namespace: ai-serving
spec:
args:
- name: canary-service
metrics:
- name: error-rate
interval: 30s
count: 10
successCondition: result[0] <= 0.01
provider:
prometheus:
query: |
sum(rate(http_requests_total{service="{{args.canary-service}}",code=~"5xx"}[1m]))
/
sum(rate(http_requests_total{service="{{args.canary-service}}"}[1m]))
- name: latency-p99
interval: 30s
count: 10
successCondition: result[0] <= 500
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="{{args.canary-service}}"}[1m]))
by (le)
) * 1000
模式4:多模型服務與版本管理
多版本模型共存架構
生產環境中,不同業務線可能依賴不同版本的模型。Sidecar代理可以同時管理多個模型版本,實現版本共存和平滑遷移。
┌─────────────────────────────────────────────────────┐
│ Multi-Model Serving Pod │
│ │
│ ┌──────────┐ ┌────────────────────────────┐ │
│ │ Business │ │ Model Router Sidecar │ │
│ │ App │────▶│ │ │
│ │ │ │ /v1/chat ──▶ v2.5 model │ │
│ │ │ │ /v1/embed ──▶ embed model │ │
│ │ │ │ /v1/rerank─▶ rerank model │ │
│ └──────────┘ └────────────┬───────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐│
│ │ vLLM │ │ TEI │ │ TEI ││
│ │ Chat │ │ Embed │ │ Rerank ││
│ │ :8000 │ │ :8080 │ │ :8081 ││
│ └──────────┘ └──────────┘ └────────┘│
└─────────────────────────────────────────────────────┘
多模型Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-serving
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: multi-model
template:
metadata:
labels:
app: multi-model
spec:
containers:
- name: chat-model
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-32B-Instruct"
- name: PORT
value: "8000"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: embedding-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8080
env:
- name: MODEL_ID
value: "BAAI/bge-large-zh-v1.5"
- name: PORT
value: "8080"
resources:
requests:
cpu: "2"
memory: "4Gi"
- name: rerank-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8081
env:
- name: MODEL_ID
value: "BAAI/bge-reranker-v2-m3"
- name: PORT
value: "8081"
- name: RERANK
value: "true"
resources:
requests:
cpu: "2"
memory: "4Gi"
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
模型版本管理CRD
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-5
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.5"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.5-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 80
canary: false
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
---
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-6-canary
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.6"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.6-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 20
canary: true
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
模式5:GPU資源池化管理
GPU共享與時間分片
GPU是AI推理最昂貴的資源。Sidecar代理可以實現GPU資源池化,讓多個推理服務共享同一塊GPU,透過時間分片提高利用率。
┌─────────────────────────────────────────────────────┐
│ GPU Resource Pooling │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ Sidecar │ │ Sidecar │ │ Sidecar │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ GPU Scheduler Sidecar │ │
│ │ │ │
│ │ Time-Slicing: │ │
│ │ GPU 0: [A][A][B][B][C][A][A][B][B][C] │ │
│ │ GPU 1: [C][C][A][B][C][C][A][B][C][C] │ │
│ │ │ │
│ │ Memory Partitioning: │ │
│ │ GPU 0: 40% A | 35% B | 25% C │ │
│ │ GPU 1: 30% A | 40% B | 30% C │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │
│ │ A100 80G │ │ A100 80G │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
GPU時間分片配置
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: ai-serving
data:
scheduler.yaml: |
scheduling:
strategy: time-slicing
gpuGroups:
- name: inference-pool
gpuIds: [0, 1, 2, 3]
timeSliceInterval: 100ms
maxSharesPerGpu: 4
memoryLimitPerShare: 20Gi
- name: embedding-pool
gpuIds: [4, 5]
timeSliceInterval: 50ms
maxSharesPerGpu: 8
memoryLimitPerShare: 10Gi
policies:
- modelType: chat
gpuGroup: inference-pool
minShares: 1
maxShares: 2
priority: high
- modelType: embedding
gpuGroup: embedding-pool
minShares: 1
maxShares: 4
priority: medium
GPU資源配額管理
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-serving
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
requests.nvidia.com/gpu-share: "32"
limits.nvidia.com/gpu-share: "32"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High priority for latency-sensitive AI inference"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-low-priority
value: 100000
globalDefault: false
description: "Low priority for batch inference jobs"
Python GPU排程器
import asyncio
import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
logger = logging.getLogger(__name__)
class TaskPriority(Enum):
HIGH = 1
MEDIUM = 2
LOW = 3
@dataclass
class InferenceTask:
taskId: str
modelId: str
gpuMemoryRequired: int
priority: TaskPriority
maxLatencyMs: int
submittedAt: float = field(default_factory=time.time)
assignedGpu: Optional[int] = None
@dataclass
class GpuSlot:
gpuId: int
totalMemory: int
usedMemory: int = 0
currentModel: Optional[str] = None
lastUsed: float = field(default_factory=time.time)
@property
def availableMemory(self) -> int:
return self.totalMemory - self.usedMemory
@property
def utilization(self) -> float:
return self.usedMemory / self.totalMemory if self.totalMemory > 0 else 0.0
class GpuScheduler:
def __init__(self, gpuSlots: List[GpuSlot], maxQueueSize: int = 1000):
self.gpuSlots = {slot.gpuId: slot for slot in gpuSlots}
self.taskQueue: List[InferenceTask] = []
self.maxQueueSize = maxQueueSize
self._lock = asyncio.Lock()
self._stats = {"totalScheduled": 0, "totalRejected": 0, "totalEvicted": 0}
async def submitTask(self, task: InferenceTask) -> Optional[int]:
async with self._lock:
if len(self.taskQueue) >= self.maxQueueSize:
self._stats["totalRejected"] += 1
return None
gpuId = self._findBestGpu(task)
if gpuId is not None:
slot = self.gpuSlots[gpuId]
slot.usedMemory += task.gpuMemoryRequired
slot.currentModel = task.modelId
slot.lastUsed = time.time()
task.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
return gpuId
self.taskQueue.append(task)
self.taskQueue.sort(key=lambda t: t.priority.value)
return None
def _findBestGpu(self, task: InferenceTask) -> Optional[int]:
candidates = [(gid, s) for gid, s in self.gpuSlots.items() if s.availableMemory >= task.gpuMemoryRequired]
if not candidates:
return self._tryEvict(task)
candidates.sort(key=lambda x: x[1].utilization)
return candidates[0][0]
def _tryEvict(self, task: InferenceTask) -> Optional[int]:
if task.priority != TaskPriority.HIGH:
return None
for gpuId, slot in self.gpuSlots.items():
if slot.currentModel and slot.lastUsed < time.time() - 300:
slot.usedMemory = 0
slot.currentModel = None
self._stats["totalEvicted"] += 1
return gpuId
return None
async def releaseGpu(self, gpuId: int, memoryFreed: int):
async with self._lock:
slot = self.gpuSlots.get(gpuId)
if slot:
slot.usedMemory = max(0, slot.usedMemory - memoryFreed)
if slot.usedMemory == 0:
slot.currentModel = None
if self.taskQueue:
nextTask = self.taskQueue[0]
if slot.availableMemory >= nextTask.gpuMemoryRequired:
self.taskQueue.pop(0)
slot.usedMemory += nextTask.gpuMemoryRequired
slot.currentModel = nextTask.modelId
nextTask.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
def getStats(self) -> Dict:
return {
**self._stats,
"queueSize": len(self.taskQueue),
"gpuUtilization": {
gid: {"utilization": f"{s.utilization:.1%}", "availableMemory": f"{s.availableMemory}MB", "currentModel": s.currentModel}
for gid, s in self.gpuSlots.items()
},
}
模式6:批處理與請求合併
動態批處理架構
LLM推理的GPU利用率通常很低(10-30%),因為每個請求單獨處理。Sidecar代理可以收集短時間視窗內的多個請求,合併為一次批量推理,大幅提升吞吐量。
┌─────────────────────────────────────────────────────┐
│ Dynamic Batching Sidecar │
│ │
│ Request 1 ──▶ ┐ │
│ Request 2 ──▶ │ ┌──────────────────────────┐ │
│ Request 3 ──▶ ├─▶│ Batch Window (50ms) │ │
│ Request 4 ──▶ │ │ │ │
│ Request 5 ──▶ ┘ │ Collect → Merge → Send │ │
│ │ │ │
│ │ Batch Size: 4-32 │ │
│ │ Max Wait: 50ms │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Model Server │ │
│ │ (vLLM with │ │
│ │ continuous │ │
│ │ batching) │ │
│ └──────────────────────┘ │
│ │
│ Throughput: 1x → 5-8x │
│ Latency overhead: +5-15ms │
└─────────────────────────────────────────────────────┘
批處理代理配置
apiVersion: v1
kind: ConfigMap
metadata:
name: batch-proxy-config
namespace: ai-serving
data:
proxy.yaml: |
server:
port: 15001
readTimeout: 30s
writeTimeout: 120s
batching:
enabled: true
maxBatchSize: 32
maxWaitTimeMs: 50
maxRequestTokens: 8192
strategy: dynamic
routing:
defaultEndpoint: http://localhost:8000
endpoints:
- path: /v1/chat/completions
model: chat
batchEnabled: true
- path: /v1/embeddings
model: embedding
batchEnabled: true
maxBatchSize: 64
circuitBreaker:
enabled: true
failureThreshold: 5
recoveryTimeout: 30s
halfOpenRequests: 3
Python批處理代理
import asyncio
import time
import uuid
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Any, Optional
from collections import defaultdict
logger = logging.getLogger(__name__)
@dataclass
class BatchRequest:
requestId: str
payload: Dict[str, Any]
future: asyncio.Future
submittedAt: float = field(default_factory=time.time)
@dataclass
class BatchConfig:
maxBatchSize: int = 32
maxWaitTimeMs: int = 50
maxRequestTokens: int = 8192
class DynamicBatcher:
def __init__(self, config: BatchConfig, inferenceFn):
self.config = config
self.inferenceFn = inferenceFn
self.pendingRequests: Dict[str, List[BatchRequest]] = defaultdict(list)
self._running = False
self._stats = {"totalBatches": 0, "totalRequests": 0, "avgBatchSize": 0.0, "avgWaitTimeMs": 0.0}
async def start(self):
self._running = True
asyncio.create_task(self._batchLoop())
async def stop(self):
self._running = False
async def submit(self, model: str, payload: Dict[str, Any]) -> Any:
future = asyncio.get_event_loop().create_future()
request = BatchRequest(requestId=str(uuid.uuid4()), payload=payload, future=future)
self.pendingRequests[model].append(request)
self._stats["totalRequests"] += 1
if len(self.pendingRequests[model]) >= self.config.maxBatchSize:
asyncio.create_task(self._processBatch(model))
return await future
async def _batchLoop(self):
while self._running:
for model in list(self.pendingRequests.keys()):
if self.pendingRequests[model]:
oldest = self.pendingRequests[model][0]
waitTime = (time.time() - oldest.submittedAt) * 1000
if waitTime >= self.config.maxWaitTimeMs:
await self._processBatch(model)
await asyncio.sleep(0.005)
async def _processBatch(self, model: str):
requests = self.pendingRequests[model][:self.config.maxBatchSize]
self.pendingRequests[model] = self.pendingRequests[model][len(requests):]
if not requests:
return
batchSize = len(requests)
self._stats["totalBatches"] += 1
waitTimes = [(time.time() - r.submittedAt) * 1000 for r in requests]
avgWait = sum(waitTimes) / len(waitTimes)
self._stats["avgWaitTimeMs"] = self._stats["avgWaitTimeMs"] * 0.95 + avgWait * 0.05
self._stats["avgBatchSize"] = self._stats["avgBatchSize"] * 0.95 + batchSize * 0.05
try:
batchPayload = self._mergePayloads([r.payload for r in requests])
results = await self.inferenceFn(model, batchPayload)
for i, request in enumerate(requests):
if not request.future.done():
request.future.set_result(results[i])
except Exception as e:
for request in requests:
if not request.future.done():
request.future.set_exception(e)
def _mergePayloads(self, payloads: List[Dict]) -> Dict:
messages = []
for payload in payloads:
if "messages" in payload:
messages.append(payload["messages"])
elif "prompt" in payload:
messages.append([{"role": "user", "content": payload["prompt"]}])
return {"model": payloads[0].get("model", "default"), "messages": messages, "stream": False, "batch_size": len(payloads)}
模式7:可觀測性與鏈路追蹤
全鏈路追蹤架構
AI推理鏈路通常涉及多個元件:API閘道器 → Sidecar代理 → 模型服務 → GPU排程器。OpenTelemetry可以實現全鏈路追蹤,幫助定位效能瓶頸。
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ Request ──▶ [Gateway] ──▶ [Sidecar] ──▶ [Model Server] │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry Collector │ │
│ │ │ │
│ │ Traces ──▶ Jaeger/Tempo │ │
│ │ Metrics ──▶ Prometheus │ │
│ │ Logs ──▶ Loki/Elasticsearch │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Key Metrics: │
│ - inference_latency_ms (P50/P95/P99) │
│ - model_tokens_per_second │
│ - gpu_utilization_percent │
│ - batch_size_avg │
│ - request_queue_depth │
│ - model_load_time_seconds │
└─────────────────────────────────────────────────────────────┘
OpenTelemetry Sidecar配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-obs
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: ai-inference-obs
template:
metadata:
labels:
app: ai-inference-obs
annotations:
sidecar.opentelemetry.io/inject: "true"
instrumentation.opentelemetry.io/inject-go: "true"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: OTEL_SERVICE_NAME
value: "ai-inference-app"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1"
- name: otel-sidecar
image: otel/opentelemetry-collector-contrib:latest
ports:
- containerPort: 4317
- containerPort: 4318
volumeMounts:
- name: otel-config
mountPath: /etc/otelcol-contrib/config.yaml
subPath: config.yaml
volumes:
- name: otel-config
configMap:
name: otel-sidecar-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-sidecar-config
namespace: ai-serving
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- set(attributes["ai.model.name"], attributes["model"]) where attributes["model"] != nil
- set(attributes["ai.inference.latency_ms"], attributes["duration"]/1000000) where attributes["duration"] != nil
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, transform, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, transform, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
自定義推理指標
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-metrics-rules
namespace: ai-serving
data:
rules.yaml: |
groups:
- name: ai_inference_metrics
interval: 15s
rules:
- record: ai:inference:latency_p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="ai-inference"}[5m])) by (le, model))
- record: ai:inference:tokens_per_second
expr: sum(rate(ai_inference_tokens_total{job="ai-inference"}[5m])) by (model)
- record: ai:inference:gpu_utilization
expr: avg(DCGM_FI_DEV_GPU_UTIL{gpu="*"}) by (instance, gpu)
- alert: InferenceHighLatency
expr: ai:inference:latency_p99 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "AI推理P99延遲超過5秒"
- alert: GPUUtilizationLow
expr: ai:inference:gpu_utilization < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU利用率低於30%"
5個常見坑及解決方案
坑1:Sidecar啟動順序導致請求丟失
現象:業務容器先於Sidecar代理啟動,初始推理請求直接發往localhost但Sidecar未就緒,請求失敗。
解決:使用postStart鉤子確保Sidecar就緒後再啟動業務容器,或配置業務容器的readinessProbe依賴Sidecar的健康檢查。
spec:
containers:
- name: business-app
readinessProbe:
httpGet:
path: /health
port: 15001
initialDelaySeconds: 5
periodSeconds: 5
- name: sidecar-proxy
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "until curl -s http://localhost:15001/health; do sleep 1; done"]
坑2:iptables規則與GPU驅動衝突
現象:Sidecar的iptables流量攔截規則導致NVIDIA GPU驅動通訊異常,模型載入失敗。
解決:排除GPU通訊相關的埠和IP段,避免iptables攔截GPU驅動流量。
metadata:
annotations:
traffic.sidecar.istio.io/excludeOutboundIPRanges: "10.96.0.0/12"
traffic.sidecar.istio.io/excludeOutboundPorts: "50051,50052"
坑3:大模型請求體超出Envoy緩衝區
現象:LLM推理請求的prompt可能非常長(數十KB),超出Envoy預設的緩衝區大小,導致請求被截斷或413錯誤。
解決:增大Envoy的請求緩衝區,或配置流式傳輸。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-dest
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
坑4:Sidecar資源爭搶導致推理延遲抖動
現象:Sidecar代理與模型服務共享Pod,CPU/記憶體資源爭搶導致推理延遲出現毛刺。
解決:為Sidecar設定獨立的資源限制,使用cpumanager的static策略繫結CPU。
spec:
containers:
- name: sidecar-proxy
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
runtimeClassName: nvidia
overhead:
podFixed:
cpu: "200m"
memory: "256Mi"
坑5:模型熱載入時Sidecar連線池耗盡
現象:模型版本切換時,舊連線未關閉,新連線建立失敗,導致連線池耗盡。
解決:配置合理的連線池超時和空閒連線回收策略。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-conn-pool
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 10s
idleTimeout: 60s
http:
maxRequestsPerConnection: 50
10個常見報錯排查
| 序號 | 報錯資訊 | 原因 | 解決方法 |
|---|---|---|---|
| 1 | Sidecar proxy not ready |
iptables規則未注入或Sidecar容器未啟動 | 檢查namespace標籤istio-injection=enabled,確認Sidecar映像拉取成功 |
| 2 | upstream connect error or disconnect/reset before headers |
模型服務未就緒或埠不匹配 | 檢查模型容器健康狀態,確認埠與VirtualService一致 |
| 3 | GPU out of memory |
模型載入超出GPU顯存 | 降低gpu_memory_utilization,使用量化模型,或增加GPU |
| 4 | connection refused to 127.0.0.1:15001 |
Sidecar未監聽預期埠 | 檢查Sidecar配置,確認inbound埠設定正確 |
| 5 | request body too large |
請求體超出Envoy緩衝區限制 | 增大max_request_size或啟用流式傳輸 |
| 6 | model not found in registry |
模型名稱與路由配置不匹配 | 檢查模型註冊名稱和路由規則的model欄位 |
| 7 | circuit breaker open |
後端模型服務連續失敗觸發熔斷 | 檢查模型服務健康狀態,調整熔斷閾值 |
| 8 | timeout waiting for batch completion |
批處理等待超時 | 增大maxWaitTimeMs或減小maxBatchSize |
| 9 | CUDA error: no kernel image is available |
GPU驅動與CUDA版本不相容 | 檢查NVIDIA驅動版本和容器CUDA版本匹配 |
| 10 | OOMKilled for sidecar container |
Sidecar記憶體不足被K8s殺掉 | 增大Sidecar的memory limit,檢查記憶體泄漏 |
進階優化技巧
1. 自適應批處理視窗
根據實時負載動態調整批處理視窗大小:
class AdaptiveBatcher:
def __init__(self, minWaitMs=10, maxWaitMs=100, targetBatchSize=16):
self.minWaitMs = minWaitMs
self.maxWaitMs = maxWaitMs
self.targetBatchSize = targetBatchSize
self.currentWaitMs = minWaitMs
self._emaArrivalRate = 0.0
def updateWaitTime(self, queueSize: int, intervalMs: float):
if intervalMs > 0:
arrivalRate = queueSize / (intervalMs / 1000.0)
self._emaArrivalRate = 0.7 * self._emaArrivalRate + 0.3 * arrivalRate
if self._emaArrivalRate > 0:
optimalWait = (self.targetBatchSize / self._emaArrivalRate) * 1000
self.currentWaitMs = max(self.minWaitMs, min(self.maxWaitMs, optimalWait))
else:
self.currentWaitMs = self.maxWaitMs
2. 模型預熱與冷啟動優化
apiVersion: v1
kind: ConfigMap
metadata:
name: model-warmup-config
namespace: ai-serving
data:
warmup.yaml: |
models:
- name: qwen-chat-v2.5
warmupRequests:
- prompt: "Hello, how are you?"
maxTokens: 32
- prompt: "Explain quantum computing in one sentence."
maxTokens: 64
warmupInterval: 300s
maxWarmupRetries: 3
3. 推理結果快取
對相同prompt的推理結果進行快取,避免重複計算:
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-cache-config
namespace: ai-serving
data:
cache.yaml: |
enabled: true
backend: redis
redis:
endpoint: redis://redis-cluster:6379
ttl: 3600
maxMemory: 2gb
keyStrategy: prompt_hash
cacheableModels:
- qwen-chat-v2.5
- bge-embedding-v1.5
hitRateThreshold: 0.3
4. 請求優先級與搶佔
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: realtime-inference
value: 1000000
globalDefault: false
description: "即時推理請求,嚴格延遲SLA"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-inference
value: 10000
globalDefault: false
description: "批次推理,無延遲SLA"
preemptionPolicy: Never
對比分析:Sidecar代理 vs Service Mesh vs Gateway
| 維度 | Sidecar代理 | Service Mesh (Istio) | API Gateway |
|---|---|---|---|
| 部署位置 | Pod內,與業務容器共存 | Pod內,全網格覆蓋 | 叢集入口,獨立部署 |
| 流量攔截 | iptables/ebpf | iptables/ztunnel | DNS/虛擬IP |
| 模型路由 | 自定義邏輯,靈活 | VirtualService,宣告式 | Route規則,有限 |
| 批處理 | 原生支援,可深度定製 | 不支援 | 不支援 |
| GPU感知 | 可感知GPU資源 | 不感知 | 不感知 |
| 效能開銷 | 低(5-15ms) | 中(10-30ms) | 低(2-5ms) |
| 可觀測性 | 自定義指標 | 全網格指標 | 入口指標 |
| 運維復雜度 | 中 | 高 | 低 |
| 適用場景 | AI推理專用代理 | 全叢集服務通訊 | 外部流量入口 |
| 學習曲線 | 中 | 高 | 低 |
推薦策略:AI推理場景使用專用Sidecar代理處理模型路由和批處理,Service Mesh處理服務間通訊,API Gateway處理外部入口。三層各司其職,互不干擾。
線上工具推薦
- YAML/JSON格式化:/zh-TW/json/format — 格式化K8s YAML配置
- Base64編解碼:/zh-TW/encode/base64 — 處理Secret中的憑證和金鑰
- curl轉程式碼:/zh-TW/dev/curl-to-code — 快速生成API測試程式碼
相關閱讀
- K8s Gateway API服務網格流量管理遷移指南 — 深入瞭解Gateway API在服務網格中的應用
- Python AI模型生產部署實戰 — Python AI模型從開發到生產的完整部署方案
- K8s HPA自動擴縮容生產實踐 — AI推理服務的自動擴縮容策略
總結
K8s Sidecar AI推理代理在2026年已成為AI推理部署的標準架構模式。7種生產模式覆蓋了從流量攔截到可觀測性的完整鏈路:Envoy流量攔截實現業務程式碼零修改,智慧模型路由根據請求特徵動態選擇模型,A/B測試與灰度發布保障模型上線安全,多模型服務實現版本共存,GPU資源池化將利用率從30%提升到80%+,批處理合併將吞吐量提升5-8倍,OpenTelemetry全鏈路追蹤讓效能瓶頸無處遁形。核心原則是:Sidecar代理專注推理邏輯,業務容器專注業務邏輯,兩者透過localhost通訊,零耦合零侵入。
外部參考:
本站提供瀏覽器本地工具,免註冊即可試用 →