K8s Sidecar AI推理代理:从流量拦截到模型路由的7种生产模式
你的AI推理服务还在裸跑吗?每次模型升级都要改业务代码?多模型版本共存时流量管理一团糟?GPU资源利用率不到30%却还在不断扩容?2026年了,Kubernetes Sidecar模式已经成为AI推理代理的标准架构,是时候用Sidecar容器解耦推理逻辑和业务逻辑了。
核心收获
- 理解K8s Sidecar AI推理代理的架构原理和适用场景
- 掌握7种生产级Sidecar代理模式的完整实现
- 学会流量拦截、模型路由、A/B测试的YAML配置
- 了解GPU资源池化、批处理合并等高级优化技巧
- 规避5个常见坑和10个高频报错
目录
- Sidecar AI代理架构全景
- 模式1:Envoy流量拦截与重写
- 模式2:智能模型路由
- 模式3:A/B测试与灰度发布
- 模式4:多模型服务与版本管理
- 模式5:GPU资源池化管理
- 模式6:批处理与请求合并
- 模式7:可观测性与链路追踪
- 5个常见坑及解决方案
- 10个常见报错排查
- 进阶优化技巧
- 对比分析:Sidecar代理 vs Service Mesh vs Gateway
- 在线工具推荐
Sidecar AI代理架构全景
为什么AI推理需要Sidecar代理?
传统AI推理部署将模型加载、推理执行、流量管理全部耦合在业务容器中。当模型版本迭代、路由策略变更、资源限制调整时,必须重新构建和部署整个服务。Sidecar代理模式将关注点分离:
┌─────────────────────────────────────────────────┐
│ Pod │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Business │ │ Sidecar Proxy │ │
│ │ Container │─────▶│ (AI Inference) │ │
│ │ │ │ │ │
│ │ - API逻辑 │ │ - 流量拦截 │ │
│ │ - 业务处理 │ │ - 模型路由 │ │
│ │ - 结果聚合 │ │ - 负载均衡 │ │
│ │ │ │ - 批处理合并 │ │
│ │ :8080 │ │ - 指标采集 │ │
│ └──────────────┘ │ │ │
│ │ │ :15001(inbound) │ │
│ │ │ :15006(outbound) │ │
│ ▼ └──────────────────────┘ │
│ ┌──────────────┐ │ │
│ │ Model │◀──────────────┘ │
│ │ Server │ │
│ │ (vLLM/Triton│ │
│ │ /Ollama) │ │
│ │ :8000 │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────┘
Sidecar代理核心职责
| 职责 | 说明 | 收益 |
|---|---|---|
| 流量拦截 | 拦截业务容器的推理请求 | 业务代码零修改 |
| 模型路由 | 根据请求特征路由到不同模型 | 多模型版本共存 |
| 负载均衡 | 在多副本间分配推理请求 | 提高吞吐量 |
| 批处理 | 合并多个请求批量推理 | GPU利用率提升3-5x |
| 熔断降级 | 模型服务异常时快速降级 | 保护系统稳定性 |
| 可观测性 | 采集推理延迟、吞吐等指标 | 全链路可观测 |
模式1:Envoy流量拦截与重写
架构原理
Envoy作为Sidecar代理,通过iptables拦截业务容器的出站流量,将推理请求重写到目标模型服务。这是K8s Sidecar模式中最经典的流量拦截方式。
Client Request
│
▼
┌─────────┐ iptables ┌──────────────┐
│ Business │───redirect───▶│ Envoy │
│ Container│ :8080 │ Sidecar │
│ │ │ :15001 │
└─────────┘ └──────┬───────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Model A │ │ Model B │ │ Model C │
│ :8000 │ │ :8001 │ │ :8002 │
└─────────┘ └─────────┘ └─────────┘
完整配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-app
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: ai-inference
template:
metadata:
labels:
app: ai-inference
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/includeOutboundIPRanges: "10.0.0.0/8"
traffic.sidecar.istio.io/excludeInboundPorts: "9090"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: INFERENCE_ENDPOINT
value: "http://localhost:15001/v1/completions"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
- name: model-server
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-72B-Instruct"
- name: GPU_MEMORY_UTILIZATION
value: "0.9"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Envoy重写规则
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-rewrite
namespace: ai-serving
spec:
hosts:
- inference-internal
http:
- match:
- uri:
prefix: "/v1/completions"
- headers:
x-model-type:
exact: "chat"
rewrite:
uri: "/v1/chat/completions"
route:
- destination:
host: model-server
port:
number: 8000
- match:
- uri:
prefix: "/v1/embeddings"
route:
- destination:
host: embedding-server
port:
number: 8001
模式2:智能模型路由
基于请求特征的动态路由
AI推理场景下,不同请求可能需要路由到不同规格的模型。Sidecar代理可以根据请求头、payload内容、token数量等特征进行智能路由。
┌──────────────────────────────────────────────────────┐
│ Smart Model Router │
│ │
│ Request ──▶ [Token Counter] ──▶ [Model Selector] │
│ │ │ │
│ │ ┌──────────────┼──────────┐ │
│ │ ▼ ▼ ▼ │
│ │ ┌───────┐ ┌─────────┐ ┌──────┐ │
│ │ │Small │ │Medium │ │Large │ │
│ │ │Model │ │Model │ │Model │ │
│ │ │<1K tok│ │1K-8K tok│ │>8K │ │
│ │ │Qwen2.5│ │Qwen2.5 │ │Qwen2.│ │
│ │ │-7B │ │-32B │ │5-72B │ │
│ │ └───────┘ └─────────┘ └──────┘ │
└──────────────────────────────────────────────────────┘
路由配置
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: smart-model-route
namespace: ai-serving
spec:
hosts:
- ai-router
http:
- match:
- headers:
x-token-range:
exact: "small"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-7b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "medium"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-32b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "large"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-72b-service
port:
number: 8000
weight: 100
- route:
- destination:
host: qwen-32b-service
port:
number: 8000
Go路由代理实现
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httputil"
"net/url"
"strings"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.uber.org/zap"
)
type ModelRouteConfig struct {
SmallModelEndpoint string `json:"smallModelEndpoint"`
MediumModelEndpoint string `json:"mediumModelEndpoint"`
LargeModelEndpoint string `json:"largeModelEndpoint"`
SmallTokenThreshold int `json:"smallTokenThreshold"`
LargeTokenThreshold int `json:"largeTokenThreshold"`
}
type InferenceRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages,omitempty"`
Prompt string `json:"prompt,omitempty"`
MaxTokens int `json:"max_tokens,omitempty"`
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type SmartRouter struct {
config *ModelRouteConfig
logger *zap.Logger
metrics *RouterMetrics
}
type RouterMetrics struct {
routeDecision *prometheus.CounterVec
requestLatency *prometheus.HistogramVec
}
func NewSmartRouter(cfg *ModelRouteConfig, logger *zap.Logger) *SmartRouter {
metrics := &RouterMetrics{
routeDecision: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "ai_router_route_decision_total",
Help: "Model route decision counter",
},
[]string{"model_size", "model_name"},
),
requestLatency: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "ai_router_request_latency_seconds",
Help: "Request routing latency",
Buckets: prometheus.DefBuckets,
},
[]string{"model_size"},
),
}
prometheus.MustRegister(metrics.routeDecision, metrics.requestLatency)
return &SmartRouter{
config: cfg,
logger: logger,
metrics: metrics,
}
}
func (r *SmartRouter) estimateTokenCount(req *InferenceRequest) int {
totalChars := 0
if req.Prompt != "" {
totalChars += len(req.Prompt)
}
for _, msg := range req.Messages {
totalChars += len(msg.Content)
}
return totalChars / 4
}
func (r *SmartRouter) selectModel(tokenCount int) (string, string) {
if tokenCount <= r.config.SmallTokenThreshold {
return "small", r.config.SmallModelEndpoint
}
if tokenCount <= r.config.LargeTokenThreshold {
return "medium", r.config.MediumModelEndpoint
}
return "large", r.config.LargeModelEndpoint
}
func (r *SmartRouter) ServeHTTP(w http.ResponseWriter, req *http.Request) {
start := time.Now()
body, err := io.ReadAll(req.Body)
if err != nil {
http.Error(w, "failed to read request body", http.StatusBadRequest)
return
}
defer req.Body.Close()
var inferReq InferenceRequest
if err := json.Unmarshal(body, &inferReq); err != nil {
http.Error(w, "invalid request format", http.StatusBadRequest)
return
}
tokenCount := r.estimateTokenCount(&inferReq)
modelSize, endpoint := r.selectModel(tokenCount)
r.logger.Info("routing decision",
zap.Int("token_count", tokenCount),
zap.String("model_size", modelSize),
zap.String("endpoint", endpoint),
)
r.metrics.routeDecision.WithLabelValues(modelSize, inferReq.Model).Inc()
target, err := url.Parse(endpoint)
if err != nil {
http.Error(w, "invalid model endpoint", http.StatusInternalServerError)
return
}
proxy := httputil.NewSingleHostReverseProxy(target)
req.Body = io.NopCloser(strings.NewReader(string(body)))
req.ContentLength = int64(len(body))
req.URL.Path = req.URL.Path
proxy.ServeHTTP(w, req)
r.metrics.requestLatency.WithLabelValues(modelSize).Observe(time.Since(start).Seconds())
}
func main() {
logger, _ := zap.NewProduction()
defer logger.Sync()
cfg := &ModelRouteConfig{
SmallModelEndpoint: "http://qwen-7b-service:8000",
MediumModelEndpoint: "http://qwen-32b-service:8000",
LargeModelEndpoint: "http://qwen-72b-service:8000",
SmallTokenThreshold: 1000,
LargeTokenThreshold: 8000,
}
router := NewSmartRouter(cfg, logger)
mux := http.NewServeMux()
mux.Handle("/v1/", router)
mux.Handle("/metrics", promhttp.Handler())
mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, "ok")
})
server := &http.Server{
Addr: ":15001",
Handler: mux,
ReadTimeout: 30 * time.Second,
WriteTimeout: 120 * time.Second,
}
logger.Info("smart router starting", zap.String("addr", server.Addr))
if err := server.ListenAndServe(); err != nil {
logger.Fatal("server failed", zap.Error(err))
}
}
模式3:A/B测试与灰度发布
基于权重的灰度发布
AI模型上线时,灰度发布是必不可少的环节。Sidecar代理可以实现基于权重的流量分配,逐步将流量从旧模型切换到新模型。
┌────────────────────────────────────────────┐
│ Canary Deployment Flow │
│ │
│ Traffic ──▶ [Sidecar Proxy] ──┬── 90% ──▶│──▶ Model v1 (Stable)
│ │ │
│ └── 10% ──▶│──▶ Model v2 (Canary)
│ │
│ Metrics: │
│ ┌──────────────────────────────────────┐ │
│ │ v1: latency_p99=120ms error=0.1% │ │
│ │ v2: latency_p99=95ms error=0.05% │ │
│ └──────────────────────────────────────┘ │
│ │
│ Decision: Promote v2 ──▶ Shift to 50/50 │
└────────────────────────────────────────────┘
灰度发布配置
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-canary
namespace: ai-serving
spec:
hosts:
- model-service
http:
- route:
- destination:
host: model-v1
port:
number: 8000
weight: 90
- destination:
host: model-v2
port:
number: 8000
weight: 10
retries:
attempts: 3
perTryTimeout: 30s
retryOn: 5xx,reset
基于Header的A/B测试
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-ab-test
namespace: ai-serving
spec:
hosts:
- model-service
http:
- match:
- headers:
x-experiment:
exact: "model-v2-creative"
route:
- destination:
host: model-v2-creative
port:
number: 8000
- match:
- headers:
x-experiment:
exact: "model-v2-precise"
route:
- destination:
host: model-v2-precise
port:
number: 8000
- route:
- destination:
host: model-v1
port:
number: 8000
Argo Rollouts自动化灰度
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-rollout
namespace: ai-serving
spec:
replicas: 4
strategy:
canary:
canaryService: model-v2
stableService: model-v1
trafficRouting:
istio:
virtualServices:
- name: model-canary
routes:
- primary
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 75
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: model-quality-check
startingStep: 2
args:
- name: canary-service
value: model-v2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-quality-check
namespace: ai-serving
spec:
args:
- name: canary-service
metrics:
- name: error-rate
interval: 30s
count: 10
successCondition: result[0] <= 0.01
provider:
prometheus:
query: |
sum(rate(http_requests_total{service="{{args.canary-service}}",code=~"5xx"}[1m]))
/
sum(rate(http_requests_total{service="{{args.canary-service}}"}[1m]))
- name: latency-p99
interval: 30s
count: 10
successCondition: result[0] <= 500
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="{{args.canary-service}}"}[1m]))
by (le)
) * 1000
模式4:多模型服务与版本管理
多版本模型共存架构
生产环境中,不同业务线可能依赖不同版本的模型。Sidecar代理可以同时管理多个模型版本,实现版本共存和平滑迁移。
┌─────────────────────────────────────────────────────┐
│ Multi-Model Serving Pod │
│ │
│ ┌──────────┐ ┌────────────────────────────┐ │
│ │ Business │ │ Model Router Sidecar │ │
│ │ App │────▶│ │ │
│ │ │ │ /v1/chat ──▶ v2.5 model │ │
│ │ │ │ /v1/embed ──▶ embed model │ │
│ │ │ │ /v1/rerank─▶ rerank model │ │
│ └──────────┘ └────────────┬───────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐│
│ │ vLLM │ │ TEI │ │ TEI ││
│ │ Chat │ │ Embed │ │ Rerank ││
│ │ :8000 │ │ :8080 │ │ :8081 ││
│ └──────────┘ └──────────┘ └────────┘│
└─────────────────────────────────────────────────────┘
多模型Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-serving
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: multi-model
template:
metadata:
labels:
app: multi-model
spec:
containers:
- name: chat-model
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-32B-Instruct"
- name: PORT
value: "8000"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: embedding-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8080
env:
- name: MODEL_ID
value: "BAAI/bge-large-zh-v1.5"
- name: PORT
value: "8080"
resources:
requests:
cpu: "2"
memory: "4Gi"
- name: rerank-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8081
env:
- name: MODEL_ID
value: "BAAI/bge-reranker-v2-m3"
- name: PORT
value: "8081"
- name: RERANK
value: "true"
resources:
requests:
cpu: "2"
memory: "4Gi"
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
模型版本管理CRD
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-5
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.5"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.5-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 80
canary: false
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
---
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-6-canary
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.6"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.6-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 20
canary: true
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
模式5:GPU资源池化管理
GPU共享与时间分片
GPU是AI推理最昂贵的资源。Sidecar代理可以实现GPU资源池化,让多个推理服务共享同一块GPU,通过时间分片提高利用率。
┌─────────────────────────────────────────────────────┐
│ GPU Resource Pooling │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ Sidecar │ │ Sidecar │ │ Sidecar │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ GPU Scheduler Sidecar │ │
│ │ │ │
│ │ Time-Slicing: │ │
│ │ GPU 0: [A][A][B][B][C][A][A][B][B][C] │ │
│ │ GPU 1: [C][C][A][B][C][C][A][B][C][C] │ │
│ │ │ │
│ │ Memory Partitioning: │ │
│ │ GPU 0: 40% A | 35% B | 25% C │ │
│ │ GPU 1: 30% A | 40% B | 30% C │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │
│ │ A100 80G │ │ A100 80G │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
GPU时间分片配置
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: ai-serving
data:
scheduler.yaml: |
scheduling:
strategy: time-slicing
gpuGroups:
- name: inference-pool
gpuIds: [0, 1, 2, 3]
timeSliceInterval: 100ms
maxSharesPerGpu: 4
memoryLimitPerShare: 20Gi
- name: embedding-pool
gpuIds: [4, 5]
timeSliceInterval: 50ms
maxSharesPerGpu: 8
memoryLimitPerShare: 10Gi
policies:
- modelType: chat
gpuGroup: inference-pool
minShares: 1
maxShares: 2
priority: high
- modelType: embedding
gpuGroup: embedding-pool
minShares: 1
maxShares: 4
priority: medium
GPU资源配额管理
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-serving
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
requests.nvidia.com/gpu-share: "32"
limits.nvidia.com/gpu-share: "32"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High priority for latency-sensitive AI inference"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-low-priority
value: 100000
globalDefault: false
description: "Low priority for batch inference jobs"
Python GPU调度器
import asyncio
import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
logger = logging.getLogger(__name__)
class TaskPriority(Enum):
HIGH = 1
MEDIUM = 2
LOW = 3
@dataclass
class InferenceTask:
taskId: str
modelId: str
gpuMemoryRequired: int
priority: TaskPriority
maxLatencyMs: int
submittedAt: float = field(default_factory=time.time)
assignedGpu: Optional[int] = None
@dataclass
class GpuSlot:
gpuId: int
totalMemory: int
usedMemory: int = 0
currentModel: Optional[str] = None
lastUsed: float = field(default_factory=time.time)
@property
def availableMemory(self) -> int:
return self.totalMemory - self.usedMemory
@property
def utilization(self) -> float:
return self.usedMemory / self.totalMemory if self.totalMemory > 0 else 0.0
class GpuScheduler:
def __init__(self, gpuSlots: List[GpuSlot], maxQueueSize: int = 1000):
self.gpuSlots = {slot.gpuId: slot for slot in gpuSlots}
self.taskQueue: List[InferenceTask] = []
self.maxQueueSize = maxQueueSize
self._lock = asyncio.Lock()
self._stats = {
"totalScheduled": 0,
"totalRejected": 0,
"totalEvicted": 0,
}
async def submitTask(self, task: InferenceTask) -> Optional[int]:
async with self._lock:
if len(self.taskQueue) >= self.maxQueueSize:
self._stats["totalRejected"] += 1
logger.warning(f"Task {task.taskId} rejected: queue full")
return None
gpuId = self._findBestGpu(task)
if gpuId is not None:
slot = self.gpuSlots[gpuId]
slot.usedMemory += task.gpuMemoryRequired
slot.currentModel = task.modelId
slot.lastUsed = time.time()
task.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
logger.info(
f"Task {task.taskId} scheduled on GPU {gpuId}, "
f"memory: {slot.usedMemory}/{slot.totalMemory}"
)
return gpuId
self.taskQueue.append(task)
self.taskQueue.sort(key=lambda t: t.priority.value)
logger.info(f"Task {task.taskId} queued, queue size: {len(self.taskQueue)}")
return None
def _findBestGpu(self, task: InferenceTask) -> Optional[int]:
candidates = []
for gpuId, slot in self.gpuSlots.items():
if slot.availableMemory >= task.gpuMemoryRequired:
candidates.append((gpuId, slot))
if not candidates:
return self._tryEvict(task)
candidates.sort(key=lambda x: x[1].utilization)
return candidates[0][0]
def _tryEvict(self, task: InferenceTask) -> Optional[int]:
if task.priority != TaskPriority.HIGH:
return None
for gpuId, slot in self.gpuSlots.items():
if slot.currentModel and slot.lastUsed < time.time() - 300:
logger.info(
f"Evicting model {slot.currentModel} from GPU {gpuId} "
f"for high-priority task {task.taskId}"
)
slot.usedMemory = 0
slot.currentModel = None
self._stats["totalEvicted"] += 1
return gpuId
return None
async def releaseGpu(self, gpuId: int, memoryFreed: int):
async with self._lock:
slot = self.gpuSlots.get(gpuId)
if slot:
slot.usedMemory = max(0, slot.usedMemory - memoryFreed)
if slot.usedMemory == 0:
slot.currentModel = None
logger.info(f"GPU {gpuId} released {memoryFreed}MB, available: {slot.availableMemory}MB")
if self.taskQueue:
nextTask = self.taskQueue[0]
if slot.availableMemory >= nextTask.gpuMemoryRequired:
self.taskQueue.pop(0)
slot.usedMemory += nextTask.gpuMemoryRequired
slot.currentModel = nextTask.modelId
nextTask.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
def getStats(self) -> Dict:
return {
**self._stats,
"queueSize": len(self.taskQueue),
"gpuUtilization": {
gpuId: {
"utilization": f"{slot.utilization:.1%}",
"availableMemory": f"{slot.availableMemory}MB",
"currentModel": slot.currentModel,
}
for gpuId, slot in self.gpuSlots.items()
},
}
模式6:批处理与请求合并
动态批处理架构
LLM推理的GPU利用率通常很低(10-30%),因为每个请求单独处理。Sidecar代理可以收集短时间窗口内的多个请求,合并为一次批量推理,大幅提升吞吐量。
┌─────────────────────────────────────────────────────┐
│ Dynamic Batching Sidecar │
│ │
│ Request 1 ──▶ ┐ │
│ Request 2 ──▶ │ ┌──────────────────────────┐ │
│ Request 3 ──▶ ├─▶│ Batch Window (50ms) │ │
│ Request 4 ──▶ │ │ │ │
│ Request 5 ──▶ ┘ │ Collect → Merge → Send │ │
│ │ │ │
│ │ Batch Size: 4-32 │ │
│ │ Max Wait: 50ms │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Model Server │ │
│ │ (vLLM with │ │
│ │ continuous │ │
│ │ batching) │ │
│ └──────────────────────┘ │
│ │
│ Throughput: 1x → 5-8x │
│ Latency overhead: +5-15ms │
└─────────────────────────────────────────────────────┘
批处理代理配置
apiVersion: v1
kind: ConfigMap
metadata:
name: batch-proxy-config
namespace: ai-serving
data:
proxy.yaml: |
server:
port: 15001
readTimeout: 30s
writeTimeout: 120s
batching:
enabled: true
maxBatchSize: 32
maxWaitTimeMs: 50
maxRequestTokens: 8192
strategy: dynamic
routing:
defaultEndpoint: http://localhost:8000
endpoints:
- path: /v1/chat/completions
model: chat
batchEnabled: true
- path: /v1/embeddings
model: embedding
batchEnabled: true
maxBatchSize: 64
circuitBreaker:
enabled: true
failureThreshold: 5
recoveryTimeout: 30s
halfOpenRequests: 3
Python批处理代理
import asyncio
import time
import uuid
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Any, Optional
from collections import defaultdict
logger = logging.getLogger(__name__)
@dataclass
class BatchRequest:
requestId: str
payload: Dict[str, Any]
future: asyncio.Future
submittedAt: float = field(default_factory=time.time)
@dataclass
class BatchConfig:
maxBatchSize: int = 32
maxWaitTimeMs: int = 50
maxRequestTokens: int = 8192
class DynamicBatcher:
def __init__(self, config: BatchConfig, inferenceFn):
self.config = config
self.inferenceFn = inferenceFn
self.pendingRequests: Dict[str, List[BatchRequest]] = defaultdict(list)
self._running = False
self._stats = {
"totalBatches": 0,
"totalRequests": 0,
"avgBatchSize": 0.0,
"avgWaitTimeMs": 0.0,
}
async def start(self):
self._running = True
asyncio.create_task(self._batchLoop())
async def stop(self):
self._running = False
async def submit(self, model: str, payload: Dict[str, Any]) -> Any:
future = asyncio.get_event_loop().create_future()
request = BatchRequest(
requestId=str(uuid.uuid4()),
payload=payload,
future=future,
)
self.pendingRequests[model].append(request)
self._stats["totalRequests"] += 1
if len(self.pendingRequests[model]) >= self.config.maxBatchSize:
asyncio.create_task(self._processBatch(model))
return await future
async def _batchLoop(self):
while self._running:
for model in list(self.pendingRequests.keys()):
if self.pendingRequests[model]:
oldest = self.pendingRequests[model][0]
waitTime = (time.time() - oldest.submittedAt) * 1000
if waitTime >= self.config.maxWaitTimeMs:
await self._processBatch(model)
await asyncio.sleep(0.005)
async def _processBatch(self, model: str):
requests = self.pendingRequests[model][:self.config.maxBatchSize]
self.pendingRequests[model] = self.pendingRequests[model][len(requests):]
if not requests:
return
batchSize = len(requests)
self._stats["totalBatches"] += 1
waitTimes = [(time.time() - r.submittedAt) * 1000 for r in requests]
avgWait = sum(waitTimes) / len(waitTimes)
self._stats["avgWaitTimeMs"] = (
self._stats["avgWaitTimeMs"] * 0.95 + avgWait * 0.05
)
self._stats["avgBatchSize"] = (
self._stats["avgBatchSize"] * 0.95 + batchSize * 0.05
)
logger.info(
f"Processing batch for model={model}, "
f"size={batchSize}, avgWait={avgWait:.1f}ms"
)
try:
batchPayload = self._mergePayloads([r.payload for r in requests])
results = await self.inferenceFn(model, batchPayload)
for i, request in enumerate(requests):
if not request.future.done():
request.future.set_result(results[i])
except Exception as e:
logger.error(f"Batch inference failed: {e}")
for request in requests:
if not request.future.done():
request.future.set_exception(e)
def _mergePayloads(self, payloads: List[Dict]) -> Dict:
messages = []
for payload in payloads:
if "messages" in payload:
messages.append(payload["messages"])
elif "prompt" in payload:
messages.append([{"role": "user", "content": payload["prompt"]}])
return {
"model": payloads[0].get("model", "default"),
"messages": messages,
"stream": False,
"batch_size": len(payloads),
}
模式7:可观测性与链路追踪
全链路追踪架构
AI推理链路通常涉及多个组件:API网关 → Sidecar代理 → 模型服务 → GPU调度器。OpenTelemetry可以实现全链路追踪,帮助定位性能瓶颈。
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ Request ──▶ [Gateway] ──▶ [Sidecar] ──▶ [Model Server] │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry Collector │ │
│ │ │ │
│ │ Traces ──▶ Jaeger/Tempo │ │
│ │ Metrics ──▶ Prometheus │ │
│ │ Logs ──▶ Loki/Elasticsearch │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Key Metrics: │
│ - inference_latency_ms (P50/P95/P99) │
│ - model_tokens_per_second │
│ - gpu_utilization_percent │
│ - batch_size_avg │
│ - request_queue_depth │
│ - model_load_time_seconds │
└─────────────────────────────────────────────────────────────┘
OpenTelemetry Sidecar配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-obs
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: ai-inference-obs
template:
metadata:
labels:
app: ai-inference-obs
annotations:
sidecar.opentelemetry.io/inject: "true"
instrumentation.opentelemetry.io/inject-go: "true"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: OTEL_SERVICE_NAME
value: "ai-inference-app"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1"
- name: otel-sidecar
image: otel/opentelemetry-collector-contrib:latest
ports:
- containerPort: 4317
- containerPort: 4318
volumeMounts:
- name: otel-config
mountPath: /etc/otelcol-contrib/config.yaml
subPath: config.yaml
volumes:
- name: otel-config
configMap:
name: otel-sidecar-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-sidecar-config
namespace: ai-serving
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.status_code"] == 200'
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- set(attributes["ai.model.name"], attributes["model"]) where attributes["model"] != nil
- set(attributes["ai.inference.latency_ms"], attributes["duration"]/1000000) where attributes["duration"] != nil
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter, transform, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, transform, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
自定义推理指标
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-metrics-rules
namespace: ai-serving
data:
rules.yaml: |
groups:
- name: ai_inference_metrics
interval: 15s
rules:
- record: ai:inference:latency_p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="ai-inference"}[5m])) by (le, model))
- record: ai:inference:tokens_per_second
expr: sum(rate(ai_inference_tokens_total{job="ai-inference"}[5m])) by (model)
- record: ai:inference:gpu_utilization
expr: avg(DCGM_FI_DEV_GPU_UTIL{gpu="*"}) by (instance, gpu)
- record: ai:inference:batch_size_avg
expr: avg(ai_inference_batch_size{job="ai-inference"}) by (model)
- record: ai:inference:queue_depth
expr: ai_inference_request_queue_depth{job="ai-inference"}
- alert: InferenceHighLatency
expr: ai:inference:latency_p99 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "AI inference P99 latency above 5s"
description: "Model {{ $labels.model }} P99 latency is {{ $value }}s"
- alert: GPUUtilizationLow
expr: ai:inference:gpu_utilization < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU utilization below 30%"
description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} utilization is {{ $value }}%"
5个常见坑及解决方案
坑1:Sidecar启动顺序导致请求丢失
现象:业务容器先于Sidecar代理启动,初始推理请求直接发往localhost但Sidecar未就绪,请求失败。
解决:使用postStart钩子确保Sidecar就绪后再启动业务容器,或配置业务容器的readinessProbe依赖Sidecar的健康检查。
spec:
containers:
- name: business-app
readinessProbe:
httpGet:
path: /health
port: 15001
initialDelaySeconds: 5
periodSeconds: 5
- name: sidecar-proxy
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "until curl -s http://localhost:15001/health; do sleep 1; done"]
坑2:iptables规则与GPU驱动冲突
现象:Sidecar的iptables流量拦截规则导致NVIDIA GPU驱动通信异常,模型加载失败。
解决:排除GPU通信相关的端口和IP段,避免iptables拦截GPU驱动流量。
metadata:
annotations:
traffic.sidecar.istio.io/excludeOutboundIPRanges: "10.96.0.0/12"
traffic.sidecar.istio.io/excludeOutboundPorts: "50051,50052"
坑3:大模型请求体超出Envoy缓冲区
现象:LLM推理请求的prompt可能非常长(数十KB),超出Envoy默认的缓冲区大小,导致请求被截断或413错误。
解决:增大Envoy的请求缓冲区,或配置流式传输。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-dest
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
tls:
mode: ISTIO_MUTUAL
坑4:Sidecar资源争抢导致推理延迟抖动
现象:Sidecar代理与模型服务共享Pod,CPU/内存资源争抢导致推理延迟出现毛刺。
解决:为Sidecar设置独立的资源限制,使用cpumanager的static策略绑定CPU。
spec:
containers:
- name: sidecar-proxy
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
runtimeClassName: nvidia
overhead:
podFixed:
cpu: "200m"
memory: "256Mi"
坑5:模型热加载时Sidecar连接池耗尽
现象:模型版本切换时,旧连接未关闭,新连接创建失败,导致连接池耗尽。
解决:配置合理的连接池超时和空闲连接回收策略。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-conn-pool
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 10s
idleTimeout: 60s
http:
maxRequestsPerConnection: 50
h2UpgradePolicy: DEFAULT
10个常见报错排查
| 序号 | 报错信息 | 原因 | 解决方法 |
|---|---|---|---|
| 1 | Sidecar proxy not ready |
iptables规则未注入或Sidecar容器未启动 | 检查namespace标签istio-injection=enabled,确认Sidecar镜像拉取成功 |
| 2 | upstream connect error or disconnect/reset before headers |
模型服务未就绪或端口不匹配 | 检查模型容器健康状态,确认端口与VirtualService一致 |
| 3 | GPU out of memory |
模型加载超出GPU显存 | 降低gpu_memory_utilization,使用量化模型,或增加GPU |
| 4 | connection refused to 127.0.0.1:15001 |
Sidecar未监听预期端口 | 检查Sidecar配置,确认inbound端口设置正确 |
| 5 | request body too large |
请求体超出Envoy缓冲区限制 | 增大max_request_size或启用流式传输 |
| 6 | model not found in registry |
模型名称与路由配置不匹配 | 检查模型注册名称和路由规则的model字段 |
| 7 | circuit breaker open |
后端模型服务连续失败触发熔断 | 检查模型服务健康状态,调整熔断阈值 |
| 8 | timeout waiting for batch completion |
批处理等待超时 | 增大maxWaitTimeMs或减小maxBatchSize |
| 9 | CUDA error: no kernel image is available |
GPU驱动与CUDA版本不兼容 | 检查NVIDIA驱动版本和容器CUDA版本匹配 |
| 10 | OOMKilled for sidecar container |
Sidecar内存不足被K8s杀掉 | 增大Sidecar的memory limit,检查内存泄漏 |
进阶优化技巧
1. 自适应批处理窗口
根据实时负载动态调整批处理窗口大小:
class AdaptiveBatcher:
def __init__(self, minWaitMs=10, maxWaitMs=100, targetBatchSize=16):
self.minWaitMs = minWaitMs
self.maxWaitMs = maxWaitMs
self.targetBatchSize = targetBatchSize
self.currentWaitMs = minWaitMs
self._emaArrivalRate = 0.0
def updateWaitTime(self, queueSize: int, intervalMs: float):
if intervalMs > 0:
arrivalRate = queueSize / (intervalMs / 1000.0)
self._emaArrivalRate = 0.7 * self._emaArrivalRate + 0.3 * arrivalRate
if self._emaArrivalRate > 0:
optimalWait = (self.targetBatchSize / self._emaArrivalRate) * 1000
self.currentWaitMs = max(self.minWaitMs, min(self.maxWaitMs, optimalWait))
else:
self.currentWaitMs = self.maxWaitMs
2. 模型预热与冷启动优化
apiVersion: v1
kind: ConfigMap
metadata:
name: model-warmup-config
namespace: ai-serving
data:
warmup.yaml: |
models:
- name: qwen-chat-v2.5
warmupRequests:
- prompt: "Hello, how are you?"
maxTokens: 32
- prompt: "Explain quantum computing in one sentence."
maxTokens: 64
warmupInterval: 300s
maxWarmupRetries: 3
3. 推理结果缓存
对相同prompt的推理结果进行缓存,避免重复计算:
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-cache-config
namespace: ai-serving
data:
cache.yaml: |
enabled: true
backend: redis
redis:
endpoint: redis://redis-cluster:6379
ttl: 3600
maxMemory: 2gb
keyStrategy: prompt_hash
cacheableModels:
- qwen-chat-v2.5
- bge-embedding-v1.5
hitRateThreshold: 0.3
4. 请求优先级与抢占
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: realtime-inference
value: 1000000
globalDefault: false
description: "Real-time inference requests with strict latency SLA"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-inference
value: 10000
globalDefault: false
description: "Batch inference with no latency SLA"
preemptionPolicy: Never
对比分析:Sidecar代理 vs Service Mesh vs Gateway
| 维度 | Sidecar代理 | Service Mesh (Istio) | API Gateway |
|---|---|---|---|
| 部署位置 | Pod内,与业务容器共存 | Pod内,全网格覆盖 | 集群入口,独立部署 |
| 流量拦截 | iptables/ebpf | iptables/ztunnel | DNS/虚拟IP |
| 模型路由 | 自定义逻辑,灵活 | VirtualService,声明式 | Route规则,有限 |
| 批处理 | 原生支持,可深度定制 | 不支持 | 不支持 |
| GPU感知 | 可感知GPU资源 | 不感知 | 不感知 |
| 性能开销 | 低(5-15ms) | 中(10-30ms) | 低(2-5ms) |
| 可观测性 | 自定义指标 | 全网格指标 | 入口指标 |
| 运维复杂度 | 中 | 高 | 低 |
| 适用场景 | AI推理专用代理 | 全集群服务通信 | 外部流量入口 |
| 学习曲线 | 中 | 高 | 低 |
推荐策略:AI推理场景使用专用Sidecar代理处理模型路由和批处理,Service Mesh处理服务间通信,API Gateway处理外部入口。三层各司其职,互不干扰。
在线工具推荐
- YAML/JSON格式化:/zh-CN/json/format — 格式化K8s YAML配置
- Base64编解码:/zh-CN/encode/base64 — 处理Secret中的证书和密钥
- curl转代码:/zh-CN/dev/curl-to-code — 快速生成API测试代码
相关阅读
- K8s Gateway API服务网格流量管理迁移指南 — 深入了解Gateway API在服务网格中的应用
- Python AI模型生产部署实战 — Python AI模型从开发到生产的完整部署方案
- K8s HPA自动扩缩容生产实践 — AI推理服务的自动扩缩容策略
总结
K8s Sidecar AI推理代理在2026年已成为AI推理部署的标准架构模式。7种生产模式覆盖了从流量拦截到可观测性的完整链路:Envoy流量拦截实现业务代码零修改,智能模型路由根据请求特征动态选择模型,A/B测试与灰度发布保障模型上线安全,多模型服务实现版本共存,GPU资源池化将利用率从30%提升到80%+,批处理合并将吞吐量提升5-8倍,OpenTelemetry全链路追踪让性能瓶颈无处遁形。核心原则是:Sidecar代理专注推理逻辑,业务容器专注业务逻辑,两者通过localhost通信,零耦合零侵入。
外部参考:
本站提供浏览器本地工具,免注册即可试用 →