K8s Sidecar AI推理代理：從流量攔截到模型路由的7種生產模式

你的AI推理服務還在裸跑嗎？每次模型升級都要改業務程式碼？多模型版本共存時流量管理一團糟？GPU資源利用率不到30%卻還在不斷擴容？2026年了，Kubernetes Sidecar模式已經成為AI推理代理的標準架構，是時候用Sidecar容器解耦推理邏輯和業務邏輯了。

核心收穫

理解K8s Sidecar AI推理代理的架構原理和適用場景
掌握7種生產級Sidecar代理模式的完整實現
學會流量攔截、模型路由、A/B測試的YAML配置
瞭解GPU資源池化、批處理合併等進階優化技巧
規避5個常見坑和10個高頻報錯

Sidecar AI代理架構全景
模式1：Envoy流量攔截與重寫
模式2：智慧模型路由
模式3：A/B測試與灰度發布
模式4：多模型服務與版本管理
模式5：GPU資源池化管理
模式6：批處理與請求合併
模式7：可觀測性與鏈路追蹤
5個常見坑及解決方案
10個常見報錯排查
進階優化技巧
對比分析：Sidecar代理 vs Service Mesh vs Gateway
線上工具推薦

Sidecar AI代理架構全景

為什麼AI推理需要Sidecar代理？

傳統AI推理部署將模型載入、推理執行、流量管理全部耦合在業務容器中。當模型版本迭代、路由策略變更、資源限制調整時，必須重新建構和部署整個服務。Sidecar代理模式將關注點分離：

┌─────────────────────────────────────────────────┐
│                   Pod                            │
│                                                  │
│  ┌──────────────┐      ┌──────────────────────┐ │
│  │  Business    │      │  Sidecar Proxy       │ │
│  │  Container   │─────▶│  (AI Inference)      │ │
│  │              │      │                      │ │
│  │  - API邏輯   │      │  - 流量攔截          │ │
│  │  - 業務處理   │      │  - 模型路由          │ │
│  │  - 結果聚合   │      │  - 負載均衡          │ │
│  │              │      │  - 批處理合併        │ │
│  │  :8080       │      │  - 指標採集          │ │
│  └──────────────┘      │                      │ │
│         │              │  :15001(inbound)     │ │
│         │              │  :15006(outbound)    │ │
│         ▼              └──────────────────────┘ │
│  ┌──────────────┐               │               │
│  │  Model       │◀──────────────┘               │
│  │  Server      │                               │
│  │  (vLLM/Triton│                               │
│  │   /Ollama)   │                               │
│  │  :8000       │                               │
│  └──────────────┘                               │
└─────────────────────────────────────────────────┘

Sidecar代理核心職責

職責	說明	收益
流量攔截	攔截業務容器的推理請求	業務程式碼零修改
模型路由	根據請求特徵路由到不同模型	多模型版本共存
負載均衡	在多副本間分配推理請求	提高吞吐量
批處理	合併多個請求批量推理	GPU利用率提升3-5x
熔斷降級	模型服務異常時快速降級	保護系統穩定性
可觀測性	採集推理延遲、吞吐等指標	全鏈路可觀測

模式1：Envoy流量攔截與重寫

架構原理

Envoy作為Sidecar代理，透過iptables攔截業務容器的出站流量，將推理請求重寫到目標模型服務。這是K8s Sidecar模式中最經典的流量攔截方式。

Client Request
     │
     ▼
┌─────────┐    iptables    ┌──────────────┐
│ Business │───redirect───▶│    Envoy     │
│ Container│    :8080      │   Sidecar    │
│          │               │   :15001     │
└─────────┘               └──────┬───────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
              ┌─────────┐  ┌─────────┐  ┌─────────┐
              │ Model A │  │ Model B │  │ Model C │
              │ :8000   │  │ :8001   │  │ :8002   │
              └─────────┘  └─────────┘  └─────────┘

完整配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-app
  namespace: ai-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-inference
  template:
    metadata:
      labels:
        app: ai-inference
      annotations:
        sidecar.istio.io/inject: "true"
        traffic.sidecar.istio.io/includeOutboundIPRanges: "10.0.0.0/8"
        traffic.sidecar.istio.io/excludeInboundPorts: "9090"
    spec:
      containers:
        - name: business-app
          image: myregistry/ai-business-app:v2.1.0
          ports:
            - containerPort: 8080
          env:
            - name: INFERENCE_ENDPOINT
              value: "http://localhost:15001/v1/completions"
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"
        - name: model-server
          image: vllm/vllm-openai:v0.6.0
          ports:
            - containerPort: 8000
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen2.5-72B-Instruct"
            - name: GPU_MEMORY_UTILIZATION
              value: "0.9"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"

Envoy重寫規則

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: inference-rewrite
  namespace: ai-serving
spec:
  hosts:
    - inference-internal
  http:
    - match:
        - uri:
            prefix: "/v1/completions"
        - headers:
            x-model-type:
              exact: "chat"
      rewrite:
        uri: "/v1/chat/completions"
      route:
        - destination:
            host: model-server
            port:
              number: 8000
    - match:
        - uri:
            prefix: "/v1/embeddings"
      route:
        - destination:
            host: embedding-server
            port:
              number: 8001

模式2：智慧模型路由

基於請求特徵的動態路由

AI推理場景下，不同請求可能需要路由到不同規格的模型。Sidecar代理可以根據請求頭、payload內容、token數量等特徵進行智慧路由。

┌──────────────────────────────────────────────────────┐
│                 Smart Model Router                    │
│                                                      │
│  Request ──▶ [Token Counter] ──▶ [Model Selector]   │
│                  │                    │               │
│                  │     ┌──────────────┼──────────┐   │
│                  │     ▼              ▼          ▼   │
│                  │  ┌───────┐  ┌─────────┐ ┌──────┐ │
│                  │  │Small  │  │Medium   │ │Large │ │
│                  │  │Model  │  │Model    │ │Model │ │
│                  │  │<1K tok│  │1K-8K tok│ │>8K   │ │
│                  │  │Qwen2.5│  │Qwen2.5  │ │Qwen2.│ │
│                  │  │-7B    │  │-32B     │ │5-72B │ │
│                  │  └───────┘  └─────────┘ └──────┘ │
└──────────────────────────────────────────────────────┘

路由配置

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: smart-model-route
  namespace: ai-serving
spec:
  hosts:
    - ai-router
  http:
    - match:
        - headers:
            x-token-range:
              exact: "small"
          uri:
            prefix: "/v1"
      route:
        - destination:
            host: qwen-7b-service
            port:
              number: 8000
          weight: 100
    - match:
        - headers:
            x-token-range:
              exact: "medium"
          uri:
            prefix: "/v1"
      route:
        - destination:
            host: qwen-32b-service
            port:
              number: 8000
          weight: 100
    - match:
        - headers:
            x-token-range:
              exact: "large"
          uri:
            prefix: "/v1"
      route:
        - destination:
            host: qwen-72b-service
            port:
              number: 8000
          weight: 100
    - route:
        - destination:
            host: qwen-32b-service
            port:
              number: 8000

Go路由代理實現

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"net/http/httputil"
	"net/url"
	"strings"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"go.uber.org/zap"
)

type ModelRouteConfig struct {
	SmallModelEndpoint  string `json:"smallModelEndpoint"`
	MediumModelEndpoint string `json:"mediumModelEndpoint"`
	LargeModelEndpoint  string `json:"largeModelEndpoint"`
	SmallTokenThreshold int    `json:"smallTokenThreshold"`
	LargeTokenThreshold int    `json:"largeTokenThreshold"`
}

type InferenceRequest struct {
	Model     string    `json:"model"`
	Messages  []Message `json:"messages,omitempty"`
	Prompt    string    `json:"prompt,omitempty"`
	MaxTokens int       `json:"max_tokens,omitempty"`
}

type Message struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

type SmartRouter struct {
	config  *ModelRouteConfig
	logger  *zap.Logger
	metrics *RouterMetrics
}

type RouterMetrics struct {
	routeDecision  *prometheus.CounterVec
	requestLatency *prometheus.HistogramVec
}

func NewSmartRouter(cfg *ModelRouteConfig, logger *zap.Logger) *SmartRouter {
	metrics := &RouterMetrics{
		routeDecision: prometheus.NewCounterVec(
			prometheus.CounterOpts{
				Name: "ai_router_route_decision_total",
				Help: "Model route decision counter",
			},
			[]string{"model_size", "model_name"},
		),
		requestLatency: prometheus.NewHistogramVec(
			prometheus.HistogramOpts{
				Name:    "ai_router_request_latency_seconds",
				Help:    "Request routing latency",
				Buckets: prometheus.DefBuckets,
			},
			[]string{"model_size"},
		),
	}
	prometheus.MustRegister(metrics.routeDecision, metrics.requestLatency)
	return &SmartRouter{config: cfg, logger: logger, metrics: metrics}
}

func (r *SmartRouter) estimateTokenCount(req *InferenceRequest) int {
	totalChars := 0
	if req.Prompt != "" {
		totalChars += len(req.Prompt)
	}
	for _, msg := range req.Messages {
		totalChars += len(msg.Content)
	}
	return totalChars / 4
}

func (r *SmartRouter) selectModel(tokenCount int) (string, string) {
	if tokenCount <= r.config.SmallTokenThreshold {
		return "small", r.config.SmallModelEndpoint
	}
	if tokenCount <= r.config.LargeTokenThreshold {
		return "medium", r.config.MediumModelEndpoint
	}
	return "large", r.config.LargeModelEndpoint
}

func (r *SmartRouter) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	start := time.Now()
	body, err := io.ReadAll(req.Body)
	if err != nil {
		http.Error(w, "failed to read request body", http.StatusBadRequest)
		return
	}
	defer req.Body.Close()

	var inferReq InferenceRequest
	if err := json.Unmarshal(body, &inferReq); err != nil {
		http.Error(w, "invalid request format", http.StatusBadRequest)
		return
	}

	tokenCount := r.estimateTokenCount(&inferReq)
	modelSize, endpoint := r.selectModel(tokenCount)
	r.logger.Info("routing decision", zap.Int("token_count", tokenCount), zap.String("model_size", modelSize))
	r.metrics.routeDecision.WithLabelValues(modelSize, inferReq.Model).Inc()

	target, _ := url.Parse(endpoint)
	proxy := httputil.NewSingleHostReverseProxy(target)
	req.Body = io.NopCloser(strings.NewReader(string(body)))
	req.ContentLength = int64(len(body))
	proxy.ServeHTTP(w, req)
	r.metrics.requestLatency.WithLabelValues(modelSize).Observe(time.Since(start).Seconds())
}

func main() {
	logger, _ := zap.NewProduction()
	defer logger.Sync()
	cfg := &ModelRouteConfig{
		SmallModelEndpoint:  "http://qwen-7b-service:8000",
		MediumModelEndpoint: "http://qwen-32b-service:8000",
		LargeModelEndpoint:  "http://qwen-72b-service:8000",
		SmallTokenThreshold: 1000,
		LargeTokenThreshold: 8000,
	}
	router := NewSmartRouter(cfg, logger)
	mux := http.NewServeMux()
	mux.Handle("/v1/", router)
	mux.Handle("/metrics", promhttp.Handler())
	mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) { fmt.Fprint(w, "ok") })
	server := &http.Server{Addr: ":15001", Handler: mux, WriteTimeout: 120 * time.Second}
	logger.Info("smart router starting", zap.String("addr", server.Addr))
	server.ListenAndServe()
}

模式3：A/B測試與灰度發布

基於權重的灰度發布

AI模型上線時，灰度發布是必不可少的環節。Sidecar代理可以實現基於權重的流量分配，逐步將流量從舊模型切換到新模型。

┌────────────────────────────────────────────┐
│           Canary Deployment Flow            │
│                                            │
│  Traffic ──▶ [Sidecar Proxy] ──┬── 90% ──▶│──▶ Model v1 (Stable)
│                                │           │
│                                └── 10% ──▶│──▶ Model v2 (Canary)
│                                            │
│  Metrics:                                   │
│  ┌──────────────────────────────────────┐  │
│  │ v1: latency_p99=120ms  error=0.1%    │  │
│  │ v2: latency_p99=95ms   error=0.05%   │  │
│  └──────────────────────────────────────┘  │
│                                            │
│  Decision: Promote v2 ──▶ Shift to 50/50  │
└────────────────────────────────────────────┘

灰度發布配置

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-canary
  namespace: ai-serving
spec:
  hosts:
    - model-service
  http:
    - route:
        - destination:
            host: model-v1
            port:
              number: 8000
          weight: 90
        - destination:
            host: model-v2
            port:
              number: 8000
          weight: 10
      retries:
        attempts: 3
        perTryTimeout: 30s
        retryOn: 5xx,reset

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-ab-test
  namespace: ai-serving
spec:
  hosts:
    - model-service
  http:
    - match:
        - headers:
            x-experiment:
              exact: "model-v2-creative"
      route:
        - destination:
            host: model-v2-creative
            port:
              number: 8000
    - match:
        - headers:
            x-experiment:
              exact: "model-v2-precise"
      route:
        - destination:
            host: model-v2-precise
            port:
              number: 8000
    - route:
        - destination:
            host: model-v1
            port:
              number: 8000

Argo Rollouts自動化灰度

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-rollout
  namespace: ai-serving
spec:
  replicas: 4
  strategy:
    canary:
      canaryService: model-v2
      stableService: model-v1
      trafficRouting:
        istio:
          virtualServices:
            - name: model-canary
              routes:
                - primary
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 75
        - pause: { duration: 10m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: model-quality-check
        startingStep: 2
        args:
          - name: canary-service
            value: model-v2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-quality-check
  namespace: ai-serving
spec:
  args:
    - name: canary-service
  metrics:
    - name: error-rate
      interval: 30s
      count: 10
      successCondition: result[0] <= 0.01
      provider:
        prometheus:
          query: |
            sum(rate(http_requests_total{service="{{args.canary-service}}",code=~"5xx"}[1m]))
            /
            sum(rate(http_requests_total{service="{{args.canary-service}}"}[1m]))
    - name: latency-p99
      interval: 30s
      count: 10
      successCondition: result[0] <= 500
      provider:
        prometheus:
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="{{args.canary-service}}"}[1m]))
              by (le)
            ) * 1000

模式4：多模型服務與版本管理

多版本模型共存架構

生產環境中，不同業務線可能依賴不同版本的模型。Sidecar代理可以同時管理多個模型版本，實現版本共存和平滑遷移。

┌─────────────────────────────────────────────────────┐
│              Multi-Model Serving Pod                 │
│                                                     │
│  ┌──────────┐     ┌────────────────────────────┐   │
│  │ Business  │     │  Model Router Sidecar      │   │
│  │ App       │────▶│                            │   │
│  │           │     │  /v1/chat ──▶ v2.5 model   │   │
│  │           │     │  /v1/embed ──▶ embed model │   │
│  │           │     │  /v1/rerank─▶ rerank model │   │
│  └──────────┘     └────────────┬───────────────┘   │
│                                  │                  │
│              ┌───────────────────┼───────────────┐  │
│              ▼                   ▼               ▼  │
│        ┌──────────┐      ┌──────────┐     ┌────────┐│
│        │ vLLM     │      │ TEI      │     │ TEI    ││
│        │ Chat     │      │ Embed    │     │ Rerank ││
│        │ :8000    │      │ :8080    │     │ :8081  ││
│        └──────────┘      └──────────┘     └────────┘│
└─────────────────────────────────────────────────────┘

多模型Deployment配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-model-serving
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: multi-model
  template:
    metadata:
      labels:
        app: multi-model
    spec:
      containers:
        - name: chat-model
          image: vllm/vllm-openai:v0.6.0
          ports:
            - containerPort: 8000
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen2.5-32B-Instruct"
            - name: PORT
              value: "8000"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
        - name: embedding-model
          image: ghcr.io/huggingface/text-embeddings-inference:latest
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_ID
              value: "BAAI/bge-large-zh-v1.5"
            - name: PORT
              value: "8080"
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
        - name: rerank-model
          image: ghcr.io/huggingface/text-embeddings-inference:latest
          ports:
            - containerPort: 8081
          env:
            - name: MODEL_ID
              value: "BAAI/bge-reranker-v2-m3"
            - name: PORT
              value: "8081"
            - name: RERANK
              value: "true"
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

模型版本管理CRD

apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
  name: qwen-chat-v2-5
  namespace: ai-serving
spec:
  modelName: qwen-chat
  version: "2.5"
  framework: vllm
  source:
    huggingFace:
      modelId: Qwen/Qwen2.5-32B-Instruct
      revision: main
  serving:
    port: 8000
    maxBatchSize: 32
    gpuMemoryUtilization: 0.9
  routing:
    weight: 80
    canary: false
  healthCheck:
    endpoint: /health
    interval: 10s
    timeout: 5s
    unhealthyThreshold: 3
---
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
  name: qwen-chat-v2-6-canary
  namespace: ai-serving
spec:
  modelName: qwen-chat
  version: "2.6"
  framework: vllm
  source:
    huggingFace:
      modelId: Qwen/Qwen2.6-32B-Instruct
      revision: main
  serving:
    port: 8000
    maxBatchSize: 32
    gpuMemoryUtilization: 0.9
  routing:
    weight: 20
    canary: true
  healthCheck:
    endpoint: /health
    interval: 10s
    timeout: 5s
    unhealthyThreshold: 3

模式5：GPU資源池化管理

GPU共享與時間分片

GPU是AI推理最昂貴的資源。Sidecar代理可以實現GPU資源池化，讓多個推理服務共享同一塊GPU，透過時間分片提高利用率。

┌─────────────────────────────────────────────────────┐
│               GPU Resource Pooling                   │
│                                                     │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │ Pod A   │  │ Pod B   │  │ Pod C   │            │
│  │ Sidecar │  │ Sidecar │  │ Sidecar │            │
│  └────┬────┘  └────┬────┘  └────┬────┘            │
│       │            │            │                  │
│       ▼            ▼            ▼                  │
│  ┌──────────────────────────────────────────┐      │
│  │         GPU Scheduler Sidecar            │      │
│  │                                          │      │
│  │  Time-Slicing:                           │      │
│  │  GPU 0: [A][A][B][B][C][A][A][B][B][C]  │      │
│  │  GPU 1: [C][C][A][B][C][C][A][B][C][C]  │      │
│  │                                          │      │
│  │  Memory Partitioning:                    │      │
│  │  GPU 0: 40% A | 35% B | 25% C           │      │
│  │  GPU 1: 30% A | 40% B | 30% C           │      │
│  └──────────────────────────────────────────┘      │
│       │            │                               │
│       ▼            ▼                               │
│  ┌──────────┐ ┌──────────┐                        │
│  │  GPU 0   │ │  GPU 1   │                        │
│  │ A100 80G │ │ A100 80G │                        │
│  └──────────┘ └──────────┘                        │
└─────────────────────────────────────────────────────┘

GPU時間分片配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-scheduler-config
  namespace: ai-serving
data:
  scheduler.yaml: |
    scheduling:
      strategy: time-slicing
      gpuGroups:
        - name: inference-pool
          gpuIds: [0, 1, 2, 3]
          timeSliceInterval: 100ms
          maxSharesPerGpu: 4
          memoryLimitPerShare: 20Gi
        - name: embedding-pool
          gpuIds: [4, 5]
          timeSliceInterval: 50ms
          maxSharesPerGpu: 8
          memoryLimitPerShare: 10Gi
    policies:
      - modelType: chat
        gpuGroup: inference-pool
        minShares: 1
        maxShares: 2
        priority: high
      - modelType: embedding
        gpuGroup: embedding-pool
        minShares: 1
        maxShares: 4
        priority: medium

GPU資源配額管理

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-serving
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
    requests.nvidia.com/gpu-share: "32"
    limits.nvidia.com/gpu-share: "32"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High priority for latency-sensitive AI inference"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-low-priority
value: 100000
globalDefault: false
description: "Low priority for batch inference jobs"

Python GPU排程器

import asyncio
import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum

logger = logging.getLogger(__name__)

class TaskPriority(Enum):
    HIGH = 1
    MEDIUM = 2
    LOW = 3

@dataclass
class InferenceTask:
    taskId: str
    modelId: str
    gpuMemoryRequired: int
    priority: TaskPriority
    maxLatencyMs: int
    submittedAt: float = field(default_factory=time.time)
    assignedGpu: Optional[int] = None

@dataclass
class GpuSlot:
    gpuId: int
    totalMemory: int
    usedMemory: int = 0
    currentModel: Optional[str] = None
    lastUsed: float = field(default_factory=time.time)

    @property
    def availableMemory(self) -> int:
        return self.totalMemory - self.usedMemory

    @property
    def utilization(self) -> float:
        return self.usedMemory / self.totalMemory if self.totalMemory > 0 else 0.0

class GpuScheduler:
    def __init__(self, gpuSlots: List[GpuSlot], maxQueueSize: int = 1000):
        self.gpuSlots = {slot.gpuId: slot for slot in gpuSlots}
        self.taskQueue: List[InferenceTask] = []
        self.maxQueueSize = maxQueueSize
        self._lock = asyncio.Lock()
        self._stats = {"totalScheduled": 0, "totalRejected": 0, "totalEvicted": 0}

    async def submitTask(self, task: InferenceTask) -> Optional[int]:
        async with self._lock:
            if len(self.taskQueue) >= self.maxQueueSize:
                self._stats["totalRejected"] += 1
                return None
            gpuId = self._findBestGpu(task)
            if gpuId is not None:
                slot = self.gpuSlots[gpuId]
                slot.usedMemory += task.gpuMemoryRequired
                slot.currentModel = task.modelId
                slot.lastUsed = time.time()
                task.assignedGpu = gpuId
                self._stats["totalScheduled"] += 1
                return gpuId
            self.taskQueue.append(task)
            self.taskQueue.sort(key=lambda t: t.priority.value)
            return None

    def _findBestGpu(self, task: InferenceTask) -> Optional[int]:
        candidates = [(gid, s) for gid, s in self.gpuSlots.items() if s.availableMemory >= task.gpuMemoryRequired]
        if not candidates:
            return self._tryEvict(task)
        candidates.sort(key=lambda x: x[1].utilization)
        return candidates[0][0]

    def _tryEvict(self, task: InferenceTask) -> Optional[int]:
        if task.priority != TaskPriority.HIGH:
            return None
        for gpuId, slot in self.gpuSlots.items():
            if slot.currentModel and slot.lastUsed < time.time() - 300:
                slot.usedMemory = 0
                slot.currentModel = None
                self._stats["totalEvicted"] += 1
                return gpuId
        return None

    async def releaseGpu(self, gpuId: int, memoryFreed: int):
        async with self._lock:
            slot = self.gpuSlots.get(gpuId)
            if slot:
                slot.usedMemory = max(0, slot.usedMemory - memoryFreed)
                if slot.usedMemory == 0:
                    slot.currentModel = None
                if self.taskQueue:
                    nextTask = self.taskQueue[0]
                    if slot.availableMemory >= nextTask.gpuMemoryRequired:
                        self.taskQueue.pop(0)
                        slot.usedMemory += nextTask.gpuMemoryRequired
                        slot.currentModel = nextTask.modelId
                        nextTask.assignedGpu = gpuId
                        self._stats["totalScheduled"] += 1

    def getStats(self) -> Dict:
        return {
            **self._stats,
            "queueSize": len(self.taskQueue),
            "gpuUtilization": {
                gid: {"utilization": f"{s.utilization:.1%}", "availableMemory": f"{s.availableMemory}MB", "currentModel": s.currentModel}
                for gid, s in self.gpuSlots.items()
            },
        }

模式6：批處理與請求合併

動態批處理架構

LLM推理的GPU利用率通常很低（10-30%），因為每個請求單獨處理。Sidecar代理可以收集短時間視窗內的多個請求，合併為一次批量推理，大幅提升吞吐量。

┌─────────────────────────────────────────────────────┐
│           Dynamic Batching Sidecar                   │
│                                                     │
│  Request 1 ──▶ ┐                                    │
│  Request 2 ──▶ │  ┌──────────────────────────┐     │
│  Request 3 ──▶ ├─▶│  Batch Window (50ms)     │     │
│  Request 4 ──▶ │  │                          │     │
│  Request 5 ──▶ ┘  │  Collect → Merge → Send  │     │
│                   │                          │     │
│                   │  Batch Size: 4-32        │     │
│                   │  Max Wait: 50ms          │     │
│                   └──────────┬───────────────┘     │
│                              │                     │
│                              ▼                     │
│                   ┌──────────────────────┐         │
│                   │  Model Server        │         │
│                   │  (vLLM with          │         │
│                   │   continuous         │         │
│                   │   batching)          │         │
│                   └──────────────────────┘         │
│                                                     │
│  Throughput: 1x → 5-8x                             │
│  Latency overhead: +5-15ms                         │
└─────────────────────────────────────────────────────┘

批處理代理配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: batch-proxy-config
  namespace: ai-serving
data:
  proxy.yaml: |
    server:
      port: 15001
      readTimeout: 30s
      writeTimeout: 120s
    batching:
      enabled: true
      maxBatchSize: 32
      maxWaitTimeMs: 50
      maxRequestTokens: 8192
      strategy: dynamic
    routing:
      defaultEndpoint: http://localhost:8000
      endpoints:
        - path: /v1/chat/completions
          model: chat
          batchEnabled: true
        - path: /v1/embeddings
          model: embedding
          batchEnabled: true
          maxBatchSize: 64
    circuitBreaker:
      enabled: true
      failureThreshold: 5
      recoveryTimeout: 30s
      halfOpenRequests: 3

Python批處理代理

import asyncio
import time
import uuid
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Any, Optional
from collections import defaultdict

logger = logging.getLogger(__name__)

@dataclass
class BatchRequest:
    requestId: str
    payload: Dict[str, Any]
    future: asyncio.Future
    submittedAt: float = field(default_factory=time.time)

@dataclass
class BatchConfig:
    maxBatchSize: int = 32
    maxWaitTimeMs: int = 50
    maxRequestTokens: int = 8192

class DynamicBatcher:
    def __init__(self, config: BatchConfig, inferenceFn):
        self.config = config
        self.inferenceFn = inferenceFn
        self.pendingRequests: Dict[str, List[BatchRequest]] = defaultdict(list)
        self._running = False
        self._stats = {"totalBatches": 0, "totalRequests": 0, "avgBatchSize": 0.0, "avgWaitTimeMs": 0.0}

    async def start(self):
        self._running = True
        asyncio.create_task(self._batchLoop())

    async def stop(self):
        self._running = False

    async def submit(self, model: str, payload: Dict[str, Any]) -> Any:
        future = asyncio.get_event_loop().create_future()
        request = BatchRequest(requestId=str(uuid.uuid4()), payload=payload, future=future)
        self.pendingRequests[model].append(request)
        self._stats["totalRequests"] += 1
        if len(self.pendingRequests[model]) >= self.config.maxBatchSize:
            asyncio.create_task(self._processBatch(model))
        return await future

    async def _batchLoop(self):
        while self._running:
            for model in list(self.pendingRequests.keys()):
                if self.pendingRequests[model]:
                    oldest = self.pendingRequests[model][0]
                    waitTime = (time.time() - oldest.submittedAt) * 1000
                    if waitTime >= self.config.maxWaitTimeMs:
                        await self._processBatch(model)
            await asyncio.sleep(0.005)

    async def _processBatch(self, model: str):
        requests = self.pendingRequests[model][:self.config.maxBatchSize]
        self.pendingRequests[model] = self.pendingRequests[model][len(requests):]
        if not requests:
            return
        batchSize = len(requests)
        self._stats["totalBatches"] += 1
        waitTimes = [(time.time() - r.submittedAt) * 1000 for r in requests]
        avgWait = sum(waitTimes) / len(waitTimes)
        self._stats["avgWaitTimeMs"] = self._stats["avgWaitTimeMs"] * 0.95 + avgWait * 0.05
        self._stats["avgBatchSize"] = self._stats["avgBatchSize"] * 0.95 + batchSize * 0.05
        try:
            batchPayload = self._mergePayloads([r.payload for r in requests])
            results = await self.inferenceFn(model, batchPayload)
            for i, request in enumerate(requests):
                if not request.future.done():
                    request.future.set_result(results[i])
        except Exception as e:
            for request in requests:
                if not request.future.done():
                    request.future.set_exception(e)

    def _mergePayloads(self, payloads: List[Dict]) -> Dict:
        messages = []
        for payload in payloads:
            if "messages" in payload:
                messages.append(payload["messages"])
            elif "prompt" in payload:
                messages.append([{"role": "user", "content": payload["prompt"]}])
        return {"model": payloads[0].get("model", "default"), "messages": messages, "stream": False, "batch_size": len(payloads)}

模式7：可觀測性與鏈路追蹤

全鏈路追蹤架構

AI推理鏈路通常涉及多個元件：API閘道器 → Sidecar代理 → 模型服務 → GPU排程器。OpenTelemetry可以實現全鏈路追蹤，幫助定位效能瓶頸。

┌─────────────────────────────────────────────────────────────┐
│                    Observability Stack                        │
│                                                             │
│  Request ──▶ [Gateway] ──▶ [Sidecar] ──▶ [Model Server]   │
│      │           │            │              │              │
│      ▼           ▼            ▼              ▼              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              OpenTelemetry Collector                 │   │
│  │                                                     │   │
│  │  Traces ──▶ Jaeger/Tempo                            │   │
│  │  Metrics ──▶ Prometheus                             │   │
│  │  Logs   ──▶ Loki/Elasticsearch                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Key Metrics:                                               │
│  - inference_latency_ms (P50/P95/P99)                      │
│  - model_tokens_per_second                                 │
│  - gpu_utilization_percent                                 │
│  - batch_size_avg                                          │
│  - request_queue_depth                                     │
│  - model_load_time_seconds                                 │
└─────────────────────────────────────────────────────────────┘

OpenTelemetry Sidecar配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-obs
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-inference-obs
  template:
    metadata:
      labels:
        app: ai-inference-obs
      annotations:
        sidecar.opentelemetry.io/inject: "true"
        instrumentation.opentelemetry.io/inject-go: "true"
    spec:
      containers:
        - name: business-app
          image: myregistry/ai-business-app:v2.1.0
          ports:
            - containerPort: 8080
          env:
            - name: OTEL_SERVICE_NAME
              value: "ai-inference-app"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://otel-collector:4317"
            - name: OTEL_TRACES_SAMPLER
              value: "parentbased_traceidratio"
            - name: OTEL_TRACES_SAMPLER_ARG
              value: "0.1"
        - name: otel-sidecar
          image: otel/opentelemetry-collector-contrib:latest
          ports:
            - containerPort: 4317
            - containerPort: 4318
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol-contrib/config.yaml
              subPath: config.yaml
      volumes:
        - name: otel-config
          configMap:
            name: otel-sidecar-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-sidecar-config
  namespace: ai-serving
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      transform:
        error_mode: ignore
        trace_statements:
          - context: span
            statements:
              - set(attributes["ai.model.name"], attributes["model"]) where attributes["model"] != nil
              - set(attributes["ai.inference.latency_ms"], attributes["duration"]/1000000) where attributes["duration"] != nil
    exporters:
      otlp/jaeger:
        endpoint: jaeger-collector:4317
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://prometheus:9090/api/v1/write
      loki:
        endpoint: http://loki:3100/loki/api/v1/push
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, transform, batch]
          exporters: [otlp/jaeger]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, transform, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

自定義推理指標

apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-metrics-rules
  namespace: ai-serving
data:
  rules.yaml: |
    groups:
      - name: ai_inference_metrics
        interval: 15s
        rules:
          - record: ai:inference:latency_p99
            expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="ai-inference"}[5m])) by (le, model))
          - record: ai:inference:tokens_per_second
            expr: sum(rate(ai_inference_tokens_total{job="ai-inference"}[5m])) by (model)
          - record: ai:inference:gpu_utilization
            expr: avg(DCGM_FI_DEV_GPU_UTIL{gpu="*"}) by (instance, gpu)
          - alert: InferenceHighLatency
            expr: ai:inference:latency_p99 > 5
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "AI推理P99延遲超過5秒"
          - alert: GPUUtilizationLow
            expr: ai:inference:gpu_utilization < 30
            for: 10m
            labels:
              severity: info
            annotations:
              summary: "GPU利用率低於30%"

5個常見坑及解決方案

坑1：Sidecar啟動順序導致請求丟失

現象：業務容器先於Sidecar代理啟動，初始推理請求直接發往localhost但Sidecar未就緒，請求失敗。

解決：使用postStart鉤子確保Sidecar就緒後再啟動業務容器，或配置業務容器的readinessProbe依賴Sidecar的健康檢查。

spec:
  containers:
    - name: business-app
      readinessProbe:
        httpGet:
          path: /health
          port: 15001
        initialDelaySeconds: 5
        periodSeconds: 5
    - name: sidecar-proxy
      lifecycle:
        postStart:
          exec:
            command: ["/bin/sh", "-c", "until curl -s http://localhost:15001/health; do sleep 1; done"]

坑2：iptables規則與GPU驅動衝突

現象：Sidecar的iptables流量攔截規則導致NVIDIA GPU驅動通訊異常，模型載入失敗。

解決：排除GPU通訊相關的埠和IP段，避免iptables攔截GPU驅動流量。

metadata:
  annotations:
    traffic.sidecar.istio.io/excludeOutboundIPRanges: "10.96.0.0/12"
    traffic.sidecar.istio.io/excludeOutboundPorts: "50051,50052"

坑3：大模型請求體超出Envoy緩衝區

現象：LLM推理請求的prompt可能非常長（數十KB），超出Envoy預設的緩衝區大小，導致請求被截斷或413錯誤。

解決：增大Envoy的請求緩衝區，或配置流式傳輸。

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: model-dest
  namespace: ai-serving
spec:
  host: model-service
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        maxRequestsPerConnection: 100

坑4：Sidecar資源爭搶導致推理延遲抖動

現象：Sidecar代理與模型服務共享Pod，CPU/記憶體資源爭搶導致推理延遲出現毛刺。

解決：為Sidecar設定獨立的資源限制，使用cpumanager的static策略繫結CPU。

spec:
  containers:
    - name: sidecar-proxy
      resources:
        requests:
          cpu: "200m"
          memory: "256Mi"
        limits:
          cpu: "500m"
          memory: "512Mi"
  runtimeClassName: nvidia
  overhead:
    podFixed:
      cpu: "200m"
      memory: "256Mi"

坑5：模型熱載入時Sidecar連線池耗盡

現象：模型版本切換時，舊連線未關閉，新連線建立失敗，導致連線池耗盡。

解決：配置合理的連線池超時和空閒連線回收策略。

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: model-conn-pool
  namespace: ai-serving
spec:
  host: model-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
        connectTimeout: 10s
        idleTimeout: 60s
      http:
        maxRequestsPerConnection: 50

10個常見報錯排查

序號	報錯資訊	原因	解決方法
1	`Sidecar proxy not ready`	iptables規則未注入或Sidecar容器未啟動	檢查namespace標籤`istio-injection=enabled`，確認Sidecar映像拉取成功
2	`upstream connect error or disconnect/reset before headers`	模型服務未就緒或埠不匹配	檢查模型容器健康狀態，確認埠與VirtualService一致
3	`GPU out of memory`	模型載入超出GPU顯存	降低`gpu_memory_utilization`，使用量化模型，或增加GPU
4	`connection refused to 127.0.0.1:15001`	Sidecar未監聽預期埠	檢查Sidecar配置，確認inbound埠設定正確
5	`request body too large`	請求體超出Envoy緩衝區限制	增大`max_request_size`或啟用流式傳輸
6	`model not found in registry`	模型名稱與路由配置不匹配	檢查模型註冊名稱和路由規則的model欄位
7	`circuit breaker open`	後端模型服務連續失敗觸發熔斷	檢查模型服務健康狀態，調整熔斷閾值
8	`timeout waiting for batch completion`	批處理等待超時	增大`maxWaitTimeMs`或減小`maxBatchSize`
9	`CUDA error: no kernel image is available`	GPU驅動與CUDA版本不相容	檢查NVIDIA驅動版本和容器CUDA版本匹配
10	`OOMKilled` for sidecar container	Sidecar記憶體不足被K8s殺掉	增大Sidecar的memory limit，檢查記憶體泄漏

進階優化技巧

1. 自適應批處理視窗

根據實時負載動態調整批處理視窗大小：

class AdaptiveBatcher:
    def __init__(self, minWaitMs=10, maxWaitMs=100, targetBatchSize=16):
        self.minWaitMs = minWaitMs
        self.maxWaitMs = maxWaitMs
        self.targetBatchSize = targetBatchSize
        self.currentWaitMs = minWaitMs
        self._emaArrivalRate = 0.0

    def updateWaitTime(self, queueSize: int, intervalMs: float):
        if intervalMs > 0:
            arrivalRate = queueSize / (intervalMs / 1000.0)
            self._emaArrivalRate = 0.7 * self._emaArrivalRate + 0.3 * arrivalRate
        if self._emaArrivalRate > 0:
            optimalWait = (self.targetBatchSize / self._emaArrivalRate) * 1000
            self.currentWaitMs = max(self.minWaitMs, min(self.maxWaitMs, optimalWait))
        else:
            self.currentWaitMs = self.maxWaitMs

2. 模型預熱與冷啟動優化

apiVersion: v1
kind: ConfigMap
metadata:
  name: model-warmup-config
  namespace: ai-serving
data:
  warmup.yaml: |
    models:
      - name: qwen-chat-v2.5
        warmupRequests:
          - prompt: "Hello, how are you?"
            maxTokens: 32
          - prompt: "Explain quantum computing in one sentence."
            maxTokens: 64
        warmupInterval: 300s
        maxWarmupRetries: 3

3. 推理結果快取

對相同prompt的推理結果進行快取，避免重複計算：

apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-cache-config
  namespace: ai-serving
data:
  cache.yaml: |
    enabled: true
    backend: redis
    redis:
      endpoint: redis://redis-cluster:6379
      ttl: 3600
      maxMemory: 2gb
    keyStrategy: prompt_hash
    cacheableModels:
      - qwen-chat-v2.5
      - bge-embedding-v1.5
    hitRateThreshold: 0.3

4. 請求優先級與搶佔

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: realtime-inference
value: 1000000
globalDefault: false
description: "即時推理請求，嚴格延遲SLA"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-inference
value: 10000
globalDefault: false
description: "批次推理，無延遲SLA"
preemptionPolicy: Never

對比分析：Sidecar代理 vs Service Mesh vs Gateway

維度	Sidecar代理	Service Mesh (Istio)	API Gateway
部署位置	Pod內，與業務容器共存	Pod內，全網格覆蓋	叢集入口，獨立部署
流量攔截	iptables/ebpf	iptables/ztunnel	DNS/虛擬IP
模型路由	自定義邏輯，靈活	VirtualService，宣告式	Route規則，有限
批處理	原生支援，可深度定製	不支援	不支援
GPU感知	可感知GPU資源	不感知	不感知
效能開銷	低（5-15ms）	中（10-30ms）	低（2-5ms）
可觀測性	自定義指標	全網格指標	入口指標
運維復雜度	中	高	低
適用場景	AI推理專用代理	全叢集服務通訊	外部流量入口
學習曲線	中	高	低

推薦策略：AI推理場景使用專用Sidecar代理處理模型路由和批處理，Service Mesh處理服務間通訊，API Gateway處理外部入口。三層各司其職，互不干擾。

線上工具推薦

YAML/JSON格式化：/zh-TW/json/format — 格式化K8s YAML配置
Base64編解碼：/zh-TW/encode/base64 — 處理Secret中的憑證和金鑰
curl轉程式碼：/zh-TW/dev/curl-to-code — 快速生成API測試程式碼

總結

K8s Sidecar AI推理代理在2026年已成為AI推理部署的標準架構模式。7種生產模式覆蓋了從流量攔截到可觀測性的完整鏈路：Envoy流量攔截實現業務程式碼零修改，智慧模型路由根據請求特徵動態選擇模型，A/B測試與灰度發布保障模型上線安全，多模型服務實現版本共存，GPU資源池化將利用率從30%提升到80%+，批處理合併將吞吐量提升5-8倍，OpenTelemetry全鏈路追蹤讓效能瓶頸無處遁形。核心原則是：Sidecar代理專注推理邏輯，業務容器專注業務邏輯，兩者透過localhost通訊，零耦合零侵入。

外部參考：