K8s Sidecar AI Inference Proxy: 7 Production Patterns from Traffic Interception to Model Routing
Still running your AI inference services bare? Modifying business code every time a model upgrades? Traffic management a mess when multiple model versions coexist? Expanding GPU capacity when utilization is below 30%? In 2026, the Kubernetes Sidecar pattern has become the standard architecture for AI inference proxies — it's time to decouple inference logic from business logic using Sidecar containers.
Key Takeaways
- Understand the architecture and use cases of K8s Sidecar AI inference proxies
- Master complete implementations of 7 production-grade Sidecar proxy patterns
- Learn YAML configurations for traffic interception, model routing, and A/B testing
- Discover advanced optimization techniques like GPU resource pooling and batch merging
- Avoid 5 common pitfalls and troubleshoot 10 frequent errors
Table of Contents
- Sidecar AI Proxy Architecture Overview
- Pattern 1: Envoy Traffic Interception and Rewriting
- Pattern 2: Smart Model Routing
- Pattern 3: A/B Testing and Canary Deployment
- Pattern 4: Multi-Model Serving and Version Management
- Pattern 5: GPU Resource Pooling
- Pattern 6: Batching and Request Merging
- Pattern 7: Observability and Distributed Tracing
- 5 Common Pitfalls and Solutions
- 10 Common Error Troubleshooting
- Advanced Optimization Techniques
- Comparison: Sidecar Proxy vs Service Mesh vs Gateway
- Recommended Online Tools
Sidecar AI Proxy Architecture Overview
Why AI Inference Needs a Sidecar Proxy
Traditional AI inference deployments couple model loading, inference execution, and traffic management entirely within the business container. When model versions iterate, routing policies change, or resource limits adjust, you must rebuild and redeploy the entire service. The Sidecar proxy pattern separates concerns:
┌─────────────────────────────────────────────────┐
│ Pod │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Business │ │ Sidecar Proxy │ │
│ │ Container │─────▶│ (AI Inference) │ │
│ │ │ │ │ │
│ │ - API logic │ │ - Traffic intercept │ │
│ │ - Business │ │ - Model routing │ │
│ │ - Result │ │ - Load balancing │ │
│ │ aggregation│ │ - Batch merging │ │
│ │ │ │ - Metrics collection│ │
│ │ :8080 │ │ │ │
│ └──────────────┘ │ :15001(inbound) │ │
│ │ │ :15006(outbound) │ │
│ ▼ └──────────────────────┘ │
│ ┌──────────────┐ │ │
│ │ Model │◀──────────────┘ │
│ │ Server │ │
│ │ (vLLM/Triton│ │
│ │ /Ollama) │ │
│ │ :8000 │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────┘
Sidecar Proxy Core Responsibilities
| Responsibility | Description | Benefit |
|---|---|---|
| Traffic interception | Intercept inference requests from business container | Zero business code changes |
| Model routing | Route to different models based on request characteristics | Multi-model version coexistence |
| Load balancing | Distribute inference requests across replicas | Higher throughput |
| Batching | Merge multiple requests for batch inference | 3-5x GPU utilization improvement |
| Circuit breaking | Fast degradation when model service fails | System stability protection |
| Observability | Collect inference latency, throughput metrics | Full-chain observability |
Pattern 1: Envoy Traffic Interception and Rewriting
Architecture
Envoy acts as the Sidecar proxy, intercepting outbound traffic from the business container via iptables and rewriting inference requests to target model services. This is the most classic traffic interception approach in the K8s Sidecar pattern.
Client Request
│
▼
┌─────────┐ iptables ┌──────────────┐
│ Business │───redirect───▶│ Envoy │
│ Container│ :8080 │ Sidecar │
│ │ │ :15001 │
└─────────┘ └──────┬───────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Model A │ │ Model B │ │ Model C │
│ :8000 │ │ :8001 │ │ :8002 │
└─────────┘ └─────────┘ └─────────┘
Complete Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-app
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: ai-inference
template:
metadata:
labels:
app: ai-inference
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/includeOutboundIPRanges: "10.0.0.0/8"
traffic.sidecar.istio.io/excludeInboundPorts: "9090"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: INFERENCE_ENDPOINT
value: "http://localhost:15001/v1/completions"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
- name: model-server
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-72B-Instruct"
- name: GPU_MEMORY_UTILIZATION
value: "0.9"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Envoy Rewrite Rules
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-rewrite
namespace: ai-serving
spec:
hosts:
- inference-internal
http:
- match:
- uri:
prefix: "/v1/completions"
- headers:
x-model-type:
exact: "chat"
rewrite:
uri: "/v1/chat/completions"
route:
- destination:
host: model-server
port:
number: 8000
- match:
- uri:
prefix: "/v1/embeddings"
route:
- destination:
host: embedding-server
port:
number: 8001
Pattern 2: Smart Model Routing
Dynamic Routing Based on Request Characteristics
In AI inference scenarios, different requests may need routing to models of different specifications. The Sidecar proxy can perform smart routing based on request headers, payload content, token count, and other features.
┌──────────────────────────────────────────────────────┐
│ Smart Model Router │
│ │
│ Request ──▶ [Token Counter] ──▶ [Model Selector] │
│ │ │ │
│ │ ┌──────────────┼──────────┐ │
│ │ ▼ ▼ ▼ │
│ │ ┌───────┐ ┌─────────┐ ┌──────┐ │
│ │ │Small │ │Medium │ │Large │ │
│ │ │Model │ │Model │ │Model │ │
│ │ │<1K tok│ │1K-8K tok│ │>8K │ │
│ │ │Qwen2.5│ │Qwen2.5 │ │Qwen2.│ │
│ │ │-7B │ │-32B │ │5-72B │ │
│ │ └───────┘ └─────────┘ └──────┘ │
└──────────────────────────────────────────────────────┘
Routing Configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: smart-model-route
namespace: ai-serving
spec:
hosts:
- ai-router
http:
- match:
- headers:
x-token-range:
exact: "small"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-7b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "medium"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-32b-service
port:
number: 8000
weight: 100
- match:
- headers:
x-token-range:
exact: "large"
uri:
prefix: "/v1"
route:
- destination:
host: qwen-72b-service
port:
number: 8000
weight: 100
- route:
- destination:
host: qwen-32b-service
port:
number: 8000
Go Router Proxy Implementation
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httputil"
"net/url"
"strings"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.uber.org/zap"
)
type ModelRouteConfig struct {
SmallModelEndpoint string `json:"smallModelEndpoint"`
MediumModelEndpoint string `json:"mediumModelEndpoint"`
LargeModelEndpoint string `json:"largeModelEndpoint"`
SmallTokenThreshold int `json:"smallTokenThreshold"`
LargeTokenThreshold int `json:"largeTokenThreshold"`
}
type InferenceRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages,omitempty"`
Prompt string `json:"prompt,omitempty"`
MaxTokens int `json:"max_tokens,omitempty"`
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type SmartRouter struct {
config *ModelRouteConfig
logger *zap.Logger
metrics *RouterMetrics
}
type RouterMetrics struct {
routeDecision *prometheus.CounterVec
requestLatency *prometheus.HistogramVec
}
func NewSmartRouter(cfg *ModelRouteConfig, logger *zap.Logger) *SmartRouter {
metrics := &RouterMetrics{
routeDecision: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "ai_router_route_decision_total",
Help: "Model route decision counter",
},
[]string{"model_size", "model_name"},
),
requestLatency: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "ai_router_request_latency_seconds",
Help: "Request routing latency",
Buckets: prometheus.DefBuckets,
},
[]string{"model_size"},
),
}
prometheus.MustRegister(metrics.routeDecision, metrics.requestLatency)
return &SmartRouter{
config: cfg,
logger: logger,
metrics: metrics,
}
}
func (r *SmartRouter) estimateTokenCount(req *InferenceRequest) int {
totalChars := 0
if req.Prompt != "" {
totalChars += len(req.Prompt)
}
for _, msg := range req.Messages {
totalChars += len(msg.Content)
}
return totalChars / 4
}
func (r *SmartRouter) selectModel(tokenCount int) (string, string) {
if tokenCount <= r.config.SmallTokenThreshold {
return "small", r.config.SmallModelEndpoint
}
if tokenCount <= r.config.LargeTokenThreshold {
return "medium", r.config.MediumModelEndpoint
}
return "large", r.config.LargeModelEndpoint
}
func (r *SmartRouter) ServeHTTP(w http.ResponseWriter, req *http.Request) {
start := time.Now()
body, err := io.ReadAll(req.Body)
if err != nil {
http.Error(w, "failed to read request body", http.StatusBadRequest)
return
}
defer req.Body.Close()
var inferReq InferenceRequest
if err := json.Unmarshal(body, &inferReq); err != nil {
http.Error(w, "invalid request format", http.StatusBadRequest)
return
}
tokenCount := r.estimateTokenCount(&inferReq)
modelSize, endpoint := r.selectModel(tokenCount)
r.logger.Info("routing decision",
zap.Int("token_count", tokenCount),
zap.String("model_size", modelSize),
zap.String("endpoint", endpoint),
)
r.metrics.routeDecision.WithLabelValues(modelSize, inferReq.Model).Inc()
target, err := url.Parse(endpoint)
if err != nil {
http.Error(w, "invalid model endpoint", http.StatusInternalServerError)
return
}
proxy := httputil.NewSingleHostReverseProxy(target)
req.Body = io.NopCloser(strings.NewReader(string(body)))
req.ContentLength = int64(len(body))
proxy.ServeHTTP(w, req)
r.metrics.requestLatency.WithLabelValues(modelSize).Observe(time.Since(start).Seconds())
}
func main() {
logger, _ := zap.NewProduction()
defer logger.Sync()
cfg := &ModelRouteConfig{
SmallModelEndpoint: "http://qwen-7b-service:8000",
MediumModelEndpoint: "http://qwen-32b-service:8000",
LargeModelEndpoint: "http://qwen-72b-service:8000",
SmallTokenThreshold: 1000,
LargeTokenThreshold: 8000,
}
router := NewSmartRouter(cfg, logger)
mux := http.NewServeMux()
mux.Handle("/v1/", router)
mux.Handle("/metrics", promhttp.Handler())
mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, "ok")
})
server := &http.Server{
Addr: ":15001",
Handler: mux,
ReadTimeout: 30 * time.Second,
WriteTimeout: 120 * time.Second,
}
logger.Info("smart router starting", zap.String("addr", server.Addr))
if err := server.ListenAndServe(); err != nil {
logger.Fatal("server failed", zap.Error(err))
}
}
Pattern 3: A/B Testing and Canary Deployment
Weight-Based Canary Deployment
When deploying AI models, canary deployment is essential. The Sidecar proxy implements weight-based traffic distribution, gradually shifting traffic from the old model to the new one.
┌────────────────────────────────────────────┐
│ Canary Deployment Flow │
│ │
│ Traffic ──▶ [Sidecar Proxy] ──┬── 90% ──▶│──▶ Model v1 (Stable)
│ │ │
│ └── 10% ──▶│──▶ Model v2 (Canary)
│ │
│ Metrics: │
│ ┌──────────────────────────────────────┐ │
│ │ v1: latency_p99=120ms error=0.1% │ │
│ │ v2: latency_p99=95ms error=0.05% │ │
│ └──────────────────────────────────────┘ │
│ │
│ Decision: Promote v2 ──▶ Shift to 50/50 │
└────────────────────────────────────────────┘
Canary Configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-canary
namespace: ai-serving
spec:
hosts:
- model-service
http:
- route:
- destination:
host: model-v1
port:
number: 8000
weight: 90
- destination:
host: model-v2
port:
number: 8000
weight: 10
retries:
attempts: 3
perTryTimeout: 30s
retryOn: 5xx,reset
Header-Based A/B Testing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-ab-test
namespace: ai-serving
spec:
hosts:
- model-service
http:
- match:
- headers:
x-experiment:
exact: "model-v2-creative"
route:
- destination:
host: model-v2-creative
port:
number: 8000
- match:
- headers:
x-experiment:
exact: "model-v2-precise"
route:
- destination:
host: model-v2-precise
port:
number: 8000
- route:
- destination:
host: model-v1
port:
number: 8000
Argo Rollouts Automated Canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-rollout
namespace: ai-serving
spec:
replicas: 4
strategy:
canary:
canaryService: model-v2
stableService: model-v1
trafficRouting:
istio:
virtualServices:
- name: model-canary
routes:
- primary
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 75
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: model-quality-check
startingStep: 2
args:
- name: canary-service
value: model-v2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-quality-check
namespace: ai-serving
spec:
args:
- name: canary-service
metrics:
- name: error-rate
interval: 30s
count: 10
successCondition: result[0] <= 0.01
provider:
prometheus:
query: |
sum(rate(http_requests_total{service="{{args.canary-service}}",code=~"5xx"}[1m]))
/
sum(rate(http_requests_total{service="{{args.canary-service}}"}[1m]))
- name: latency-p99
interval: 30s
count: 10
successCondition: result[0] <= 500
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="{{args.canary-service}}"}[1m]))
by (le)
) * 1000
Pattern 4: Multi-Model Serving and Version Management
Multi-Version Model Coexistence Architecture
In production, different business lines may depend on different model versions. The Sidecar proxy can manage multiple model versions simultaneously, enabling version coexistence and smooth migration.
┌─────────────────────────────────────────────────────┐
│ Multi-Model Serving Pod │
│ │
│ ┌──────────┐ ┌────────────────────────────┐ │
│ │ Business │ │ Model Router Sidecar │ │
│ │ App │────▶│ │ │
│ │ │ │ /v1/chat ──▶ v2.5 model │ │
│ │ │ │ /v1/embed ──▶ embed model │ │
│ │ │ │ /v1/rerank─▶ rerank model │ │
│ └──────────┘ └────────────┬───────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐│
│ │ vLLM │ │ TEI │ │ TEI ││
│ │ Chat │ │ Embed │ │ Rerank ││
│ │ :8000 │ │ :8080 │ │ :8081 ││
│ └──────────┘ └──────────┘ └────────┘│
└─────────────────────────────────────────────────────┘
Multi-Model Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-serving
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: multi-model
template:
metadata:
labels:
app: multi-model
spec:
containers:
- name: chat-model
image: vllm/vllm-openai:v0.6.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-32B-Instruct"
- name: PORT
value: "8000"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: embedding-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8080
env:
- name: MODEL_ID
value: "BAAI/bge-large-zh-v1.5"
- name: PORT
value: "8080"
resources:
requests:
cpu: "2"
memory: "4Gi"
- name: rerank-model
image: ghcr.io/huggingface/text-embeddings-inference:latest
ports:
- containerPort: 8081
env:
- name: MODEL_ID
value: "BAAI/bge-reranker-v2-m3"
- name: PORT
value: "8081"
- name: RERANK
value: "true"
resources:
requests:
cpu: "2"
memory: "4Gi"
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
Model Version Management CRD
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-5
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.5"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.5-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 80
canary: false
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
---
apiVersion: ai.toolsku.dev/v1alpha1
kind: ModelVersion
metadata:
name: qwen-chat-v2-6-canary
namespace: ai-serving
spec:
modelName: qwen-chat
version: "2.6"
framework: vllm
source:
huggingFace:
modelId: Qwen/Qwen2.6-32B-Instruct
revision: main
serving:
port: 8000
maxBatchSize: 32
gpuMemoryUtilization: 0.9
routing:
weight: 20
canary: true
healthCheck:
endpoint: /health
interval: 10s
timeout: 5s
unhealthyThreshold: 3
Pattern 5: GPU Resource Pooling
GPU Sharing and Time-Slicing
GPUs are the most expensive resource in AI inference. The Sidecar proxy enables GPU resource pooling, allowing multiple inference services to share the same GPU through time-slicing to improve utilization.
┌─────────────────────────────────────────────────────┐
│ GPU Resource Pooling │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ Sidecar │ │ Sidecar │ │ Sidecar │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ GPU Scheduler Sidecar │ │
│ │ │ │
│ │ Time-Slicing: │ │
│ │ GPU 0: [A][A][B][B][C][A][A][B][B][C] │ │
│ │ GPU 1: [C][C][A][B][C][C][A][B][C][C] │ │
│ │ │ │
│ │ Memory Partitioning: │ │
│ │ GPU 0: 40% A | 35% B | 25% C │ │
│ │ GPU 1: 30% A | 40% B | 30% C │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │
│ │ A100 80G │ │ A100 80G │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
GPU Time-Slicing Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: ai-serving
data:
scheduler.yaml: |
scheduling:
strategy: time-slicing
gpuGroups:
- name: inference-pool
gpuIds: [0, 1, 2, 3]
timeSliceInterval: 100ms
maxSharesPerGpu: 4
memoryLimitPerShare: 20Gi
- name: embedding-pool
gpuIds: [4, 5]
timeSliceInterval: 50ms
maxSharesPerGpu: 8
memoryLimitPerShare: 10Gi
policies:
- modelType: chat
gpuGroup: inference-pool
minShares: 1
maxShares: 2
priority: high
- modelType: embedding
gpuGroup: embedding-pool
minShares: 1
maxShares: 4
priority: medium
GPU Resource Quota Management
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-serving
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
requests.nvidia.com/gpu-share: "32"
limits.nvidia.com/gpu-share: "32"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High priority for latency-sensitive AI inference"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-low-priority
value: 100000
globalDefault: false
description: "Low priority for batch inference jobs"
Python GPU Scheduler
import asyncio
import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
logger = logging.getLogger(__name__)
class TaskPriority(Enum):
HIGH = 1
MEDIUM = 2
LOW = 3
@dataclass
class InferenceTask:
taskId: str
modelId: str
gpuMemoryRequired: int
priority: TaskPriority
maxLatencyMs: int
submittedAt: float = field(default_factory=time.time)
assignedGpu: Optional[int] = None
@dataclass
class GpuSlot:
gpuId: int
totalMemory: int
usedMemory: int = 0
currentModel: Optional[str] = None
lastUsed: float = field(default_factory=time.time)
@property
def availableMemory(self) -> int:
return self.totalMemory - self.usedMemory
@property
def utilization(self) -> float:
return self.usedMemory / self.totalMemory if self.totalMemory > 0 else 0.0
class GpuScheduler:
def __init__(self, gpuSlots: List[GpuSlot], maxQueueSize: int = 1000):
self.gpuSlots = {slot.gpuId: slot for slot in gpuSlots}
self.taskQueue: List[InferenceTask] = []
self.maxQueueSize = maxQueueSize
self._lock = asyncio.Lock()
self._stats = {
"totalScheduled": 0,
"totalRejected": 0,
"totalEvicted": 0,
}
async def submitTask(self, task: InferenceTask) -> Optional[int]:
async with self._lock:
if len(self.taskQueue) >= self.maxQueueSize:
self._stats["totalRejected"] += 1
logger.warning(f"Task {task.taskId} rejected: queue full")
return None
gpuId = self._findBestGpu(task)
if gpuId is not None:
slot = self.gpuSlots[gpuId]
slot.usedMemory += task.gpuMemoryRequired
slot.currentModel = task.modelId
slot.lastUsed = time.time()
task.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
logger.info(
f"Task {task.taskId} scheduled on GPU {gpuId}, "
f"memory: {slot.usedMemory}/{slot.totalMemory}"
)
return gpuId
self.taskQueue.append(task)
self.taskQueue.sort(key=lambda t: t.priority.value)
logger.info(f"Task {task.taskId} queued, queue size: {len(self.taskQueue)}")
return None
def _findBestGpu(self, task: InferenceTask) -> Optional[int]:
candidates = []
for gpuId, slot in self.gpuSlots.items():
if slot.availableMemory >= task.gpuMemoryRequired:
candidates.append((gpuId, slot))
if not candidates:
return self._tryEvict(task)
candidates.sort(key=lambda x: x[1].utilization)
return candidates[0][0]
def _tryEvict(self, task: InferenceTask) -> Optional[int]:
if task.priority != TaskPriority.HIGH:
return None
for gpuId, slot in self.gpuSlots.items():
if slot.currentModel and slot.lastUsed < time.time() - 300:
logger.info(
f"Evicting model {slot.currentModel} from GPU {gpuId} "
f"for high-priority task {task.taskId}"
)
slot.usedMemory = 0
slot.currentModel = None
self._stats["totalEvicted"] += 1
return gpuId
return None
async def releaseGpu(self, gpuId: int, memoryFreed: int):
async with self._lock:
slot = self.gpuSlots.get(gpuId)
if slot:
slot.usedMemory = max(0, slot.usedMemory - memoryFreed)
if slot.usedMemory == 0:
slot.currentModel = None
logger.info(f"GPU {gpuId} released {memoryFreed}MB, available: {slot.availableMemory}MB")
if self.taskQueue:
nextTask = self.taskQueue[0]
if slot.availableMemory >= nextTask.gpuMemoryRequired:
self.taskQueue.pop(0)
slot.usedMemory += nextTask.gpuMemoryRequired
slot.currentModel = nextTask.modelId
nextTask.assignedGpu = gpuId
self._stats["totalScheduled"] += 1
def getStats(self) -> Dict:
return {
**self._stats,
"queueSize": len(self.taskQueue),
"gpuUtilization": {
gpuId: {
"utilization": f"{slot.utilization:.1%}",
"availableMemory": f"{slot.availableMemory}MB",
"currentModel": slot.currentModel,
}
for gpuId, slot in self.gpuSlots.items()
},
}
Pattern 6: Batching and Request Merging
Dynamic Batching Architecture
LLM inference GPU utilization is typically low (10-30%) because each request is processed individually. The Sidecar proxy can collect multiple requests within a short time window and merge them into a single batch inference, dramatically improving throughput.
┌─────────────────────────────────────────────────────┐
│ Dynamic Batching Sidecar │
│ │
│ Request 1 ──▶ ┐ │
│ Request 2 ──▶ │ ┌──────────────────────────┐ │
│ Request 3 ──▶ ├─▶│ Batch Window (50ms) │ │
│ Request 4 ──▶ │ │ │ │
│ Request 5 ──▶ ┘ │ Collect → Merge → Send │ │
│ │ │ │
│ │ Batch Size: 4-32 │ │
│ │ Max Wait: 50ms │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Model Server │ │
│ │ (vLLM with │ │
│ │ continuous │ │
│ │ batching) │ │
│ └──────────────────────┘ │
│ │
│ Throughput: 1x → 5-8x │
│ Latency overhead: +5-15ms │
└─────────────────────────────────────────────────────┘
Batch Proxy Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: batch-proxy-config
namespace: ai-serving
data:
proxy.yaml: |
server:
port: 15001
readTimeout: 30s
writeTimeout: 120s
batching:
enabled: true
maxBatchSize: 32
maxWaitTimeMs: 50
maxRequestTokens: 8192
strategy: dynamic
routing:
defaultEndpoint: http://localhost:8000
endpoints:
- path: /v1/chat/completions
model: chat
batchEnabled: true
- path: /v1/embeddings
model: embedding
batchEnabled: true
maxBatchSize: 64
circuitBreaker:
enabled: true
failureThreshold: 5
recoveryTimeout: 30s
halfOpenRequests: 3
Python Batch Proxy
import asyncio
import time
import uuid
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Any, Optional
from collections import defaultdict
logger = logging.getLogger(__name__)
@dataclass
class BatchRequest:
requestId: str
payload: Dict[str, Any]
future: asyncio.Future
submittedAt: float = field(default_factory=time.time)
@dataclass
class BatchConfig:
maxBatchSize: int = 32
maxWaitTimeMs: int = 50
maxRequestTokens: int = 8192
class DynamicBatcher:
def __init__(self, config: BatchConfig, inferenceFn):
self.config = config
self.inferenceFn = inferenceFn
self.pendingRequests: Dict[str, List[BatchRequest]] = defaultdict(list)
self._running = False
self._stats = {
"totalBatches": 0,
"totalRequests": 0,
"avgBatchSize": 0.0,
"avgWaitTimeMs": 0.0,
}
async def start(self):
self._running = True
asyncio.create_task(self._batchLoop())
async def stop(self):
self._running = False
async def submit(self, model: str, payload: Dict[str, Any]) -> Any:
future = asyncio.get_event_loop().create_future()
request = BatchRequest(
requestId=str(uuid.uuid4()),
payload=payload,
future=future,
)
self.pendingRequests[model].append(request)
self._stats["totalRequests"] += 1
if len(self.pendingRequests[model]) >= self.config.maxBatchSize:
asyncio.create_task(self._processBatch(model))
return await future
async def _batchLoop(self):
while self._running:
for model in list(self.pendingRequests.keys()):
if self.pendingRequests[model]:
oldest = self.pendingRequests[model][0]
waitTime = (time.time() - oldest.submittedAt) * 1000
if waitTime >= self.config.maxWaitTimeMs:
await self._processBatch(model)
await asyncio.sleep(0.005)
async def _processBatch(self, model: str):
requests = self.pendingRequests[model][:self.config.maxBatchSize]
self.pendingRequests[model] = self.pendingRequests[model][len(requests):]
if not requests:
return
batchSize = len(requests)
self._stats["totalBatches"] += 1
waitTimes = [(time.time() - r.submittedAt) * 1000 for r in requests]
avgWait = sum(waitTimes) / len(waitTimes)
self._stats["avgWaitTimeMs"] = (
self._stats["avgWaitTimeMs"] * 0.95 + avgWait * 0.05
)
self._stats["avgBatchSize"] = (
self._stats["avgBatchSize"] * 0.95 + batchSize * 0.05
)
logger.info(
f"Processing batch for model={model}, "
f"size={batchSize}, avgWait={avgWait:.1f}ms"
)
try:
batchPayload = self._mergePayloads([r.payload for r in requests])
results = await self.inferenceFn(model, batchPayload)
for i, request in enumerate(requests):
if not request.future.done():
request.future.set_result(results[i])
except Exception as e:
logger.error(f"Batch inference failed: {e}")
for request in requests:
if not request.future.done():
request.future.set_exception(e)
def _mergePayloads(self, payloads: List[Dict]) -> Dict:
messages = []
for payload in payloads:
if "messages" in payload:
messages.append(payload["messages"])
elif "prompt" in payload:
messages.append([{"role": "user", "content": payload["prompt"]}])
return {
"model": payloads[0].get("model", "default"),
"messages": messages,
"stream": False,
"batch_size": len(payloads),
}
Pattern 7: Observability and Distributed Tracing
Full-Chain Tracing Architecture
AI inference chains typically involve multiple components: API Gateway → Sidecar Proxy → Model Server → GPU Scheduler. OpenTelemetry enables full-chain tracing to help identify performance bottlenecks.
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ Request ──▶ [Gateway] ──▶ [Sidecar] ──▶ [Model Server] │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry Collector │ │
│ │ │ │
│ │ Traces ──▶ Jaeger/Tempo │ │
│ │ Metrics ──▶ Prometheus │ │
│ │ Logs ──▶ Loki/Elasticsearch │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Key Metrics: │
│ - inference_latency_ms (P50/P95/P99) │
│ - model_tokens_per_second │
│ - gpu_utilization_percent │
│ - batch_size_avg │
│ - request_queue_depth │
│ - model_load_time_seconds │
└─────────────────────────────────────────────────────────────┘
OpenTelemetry Sidecar Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-obs
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: ai-inference-obs
template:
metadata:
labels:
app: ai-inference-obs
annotations:
sidecar.opentelemetry.io/inject: "true"
instrumentation.opentelemetry.io/inject-go: "true"
spec:
containers:
- name: business-app
image: myregistry/ai-business-app:v2.1.0
ports:
- containerPort: 8080
env:
- name: OTEL_SERVICE_NAME
value: "ai-inference-app"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1"
- name: otel-sidecar
image: otel/opentelemetry-collector-contrib:latest
ports:
- containerPort: 4317
- containerPort: 4318
volumeMounts:
- name: otel-config
mountPath: /etc/otelcol-contrib/config.yaml
subPath: config.yaml
volumes:
- name: otel-config
configMap:
name: otel-sidecar-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-sidecar-config
namespace: ai-serving
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.status_code"] == 200'
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- set(attributes["ai.model.name"], attributes["model"]) where attributes["model"] != nil
- set(attributes["ai.inference.latency_ms"], attributes["duration"]/1000000) where attributes["duration"] != nil
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter, transform, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, transform, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Custom Inference Metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-metrics-rules
namespace: ai-serving
data:
rules.yaml: |
groups:
- name: ai_inference_metrics
interval: 15s
rules:
- record: ai:inference:latency_p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="ai-inference"}[5m])) by (le, model))
- record: ai:inference:tokens_per_second
expr: sum(rate(ai_inference_tokens_total{job="ai-inference"}[5m])) by (model)
- record: ai:inference:gpu_utilization
expr: avg(DCGM_FI_DEV_GPU_UTIL{gpu="*"}) by (instance, gpu)
- record: ai:inference:batch_size_avg
expr: avg(ai_inference_batch_size{job="ai-inference"}) by (model)
- record: ai:inference:queue_depth
expr: ai_inference_request_queue_depth{job="ai-inference"}
- alert: InferenceHighLatency
expr: ai:inference:latency_p99 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "AI inference P99 latency above 5s"
description: "Model {{ $labels.model }} P99 latency is {{ $value }}s"
- alert: GPUUtilizationLow
expr: ai:inference:gpu_utilization < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU utilization below 30%"
description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} utilization is {{ $value }}%"
5 Common Pitfalls and Solutions
Pitfall 1: Sidecar Startup Order Causes Request Loss
Symptom: Business container starts before the Sidecar proxy, initial inference requests fail because the Sidecar is not ready.
Solution: Use postStart hooks to ensure the Sidecar is ready before the business container starts, or configure the business container's readinessProbe to depend on the Sidecar's health check.
spec:
containers:
- name: business-app
readinessProbe:
httpGet:
path: /health
port: 15001
initialDelaySeconds: 5
periodSeconds: 5
- name: sidecar-proxy
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "until curl -s http://localhost:15001/health; do sleep 1; done"]
Pitfall 2: iptables Rules Conflict with GPU Drivers
Symptom: Sidecar iptables traffic interception rules cause NVIDIA GPU driver communication errors, model loading fails.
Solution: Exclude GPU communication ports and IP ranges from iptables interception.
metadata:
annotations:
traffic.sidecar.istio.io/excludeOutboundIPRanges: "10.96.0.0/12"
traffic.sidecar.istio.io/excludeOutboundPorts: "50051,50052"
Pitfall 3: Large Model Request Body Exceeds Envoy Buffer
Symptom: LLM inference request prompts can be very long (tens of KB), exceeding Envoy's default buffer size, causing truncated requests or 413 errors.
Solution: Increase Envoy's request buffer size or configure streaming.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-dest
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
tls:
mode: ISTIO_MUTUAL
Pitfall 4: Sidecar Resource Contention Causes Inference Latency Spikes
Symptom: Sidecar proxy and model service share the Pod, CPU/memory resource contention causes inference latency spikes.
Solution: Set independent resource limits for the Sidecar and use cpumanager static policy for CPU pinning.
spec:
containers:
- name: sidecar-proxy
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
runtimeClassName: nvidia
overhead:
podFixed:
cpu: "200m"
memory: "256Mi"
Pitfall 5: Connection Pool Exhaustion During Model Hot-Loading
Symptom: During model version switching, old connections are not closed, new connections fail to create, causing connection pool exhaustion.
Solution: Configure reasonable connection pool timeouts and idle connection recycling policies.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-conn-pool
namespace: ai-serving
spec:
host: model-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 10s
idleTimeout: 60s
http:
maxRequestsPerConnection: 50
h2UpgradePolicy: DEFAULT
10 Common Error Troubleshooting
| # | Error Message | Cause | Solution |
|---|---|---|---|
| 1 | Sidecar proxy not ready |
iptables rules not injected or Sidecar container not started | Check namespace label istio-injection=enabled, confirm Sidecar image pulled |
| 2 | upstream connect error or disconnect/reset before headers |
Model service not ready or port mismatch | Check model container health, confirm port matches VirtualService |
| 3 | GPU out of memory |
Model loading exceeds GPU memory | Lower gpu_memory_utilization, use quantized models, or add GPUs |
| 4 | connection refused to 127.0.0.1:15001 |
Sidecar not listening on expected port | Check Sidecar config, confirm inbound port settings |
| 5 | request body too large |
Request body exceeds Envoy buffer limit | Increase max_request_size or enable streaming |
| 6 | model not found in registry |
Model name doesn't match routing config | Check model registration name and routing rule model field |
| 7 | circuit breaker open |
Backend model service consecutive failures triggered circuit break | Check model service health, adjust circuit breaker thresholds |
| 8 | timeout waiting for batch completion |
Batch wait timeout | Increase maxWaitTimeMs or decrease maxBatchSize |
| 9 | CUDA error: no kernel image is available |
GPU driver and CUDA version incompatible | Check NVIDIA driver version and container CUDA version match |
| 10 | OOMKilled for sidecar container |
Sidecar out of memory killed by K8s | Increase Sidecar memory limit, check for memory leaks |
Advanced Optimization Techniques
1. Adaptive Batching Window
Dynamically adjust batching window size based on real-time load:
class AdaptiveBatcher:
def __init__(self, minWaitMs=10, maxWaitMs=100, targetBatchSize=16):
self.minWaitMs = minWaitMs
self.maxWaitMs = maxWaitMs
self.targetBatchSize = targetBatchSize
self.currentWaitMs = minWaitMs
self._emaArrivalRate = 0.0
def updateWaitTime(self, queueSize: int, intervalMs: float):
if intervalMs > 0:
arrivalRate = queueSize / (intervalMs / 1000.0)
self._emaArrivalRate = 0.7 * self._emaArrivalRate + 0.3 * arrivalRate
if self._emaArrivalRate > 0:
optimalWait = (self.targetBatchSize / self._emaArrivalRate) * 1000
self.currentWaitMs = max(self.minWaitMs, min(self.maxWaitMs, optimalWait))
else:
self.currentWaitMs = self.maxWaitMs
2. Model Warmup and Cold Start Optimization
apiVersion: v1
kind: ConfigMap
metadata:
name: model-warmup-config
namespace: ai-serving
data:
warmup.yaml: |
models:
- name: qwen-chat-v2.5
warmupRequests:
- prompt: "Hello, how are you?"
maxTokens: 32
- prompt: "Explain quantum computing in one sentence."
maxTokens: 64
warmupInterval: 300s
maxWarmupRetries: 3
3. Inference Result Caching
Cache inference results for identical prompts to avoid redundant computation:
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-cache-config
namespace: ai-serving
data:
cache.yaml: |
enabled: true
backend: redis
redis:
endpoint: redis://redis-cluster:6379
ttl: 3600
maxMemory: 2gb
keyStrategy: prompt_hash
cacheableModels:
- qwen-chat-v2.5
- bge-embedding-v1.5
hitRateThreshold: 0.3
4. Request Priority and Preemption
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: realtime-inference
value: 1000000
globalDefault: false
description: "Real-time inference requests with strict latency SLA"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-inference
value: 10000
globalDefault: false
description: "Batch inference with no latency SLA"
preemptionPolicy: Never
Comparison: Sidecar Proxy vs Service Mesh vs Gateway
| Dimension | Sidecar Proxy | Service Mesh (Istio) | API Gateway |
|---|---|---|---|
| Deployment | In Pod, co-located with business container | In Pod, mesh-wide coverage | Cluster ingress, standalone |
| Traffic interception | iptables/ebpf | iptables/ztunnel | DNS/virtual IP |
| Model routing | Custom logic, flexible | VirtualService, declarative | Route rules, limited |
| Batching | Native support, deeply customizable | Not supported | Not supported |
| GPU awareness | Can sense GPU resources | Not aware | Not aware |
| Performance overhead | Low (5-15ms) | Medium (10-30ms) | Low (2-5ms) |
| Observability | Custom metrics | Full mesh metrics | Ingress metrics |
| Ops complexity | Medium | High | Low |
| Use case | AI inference-specific proxy | Cluster-wide service communication | External traffic ingress |
| Learning curve | Medium | High | Low |
Recommended strategy: Use a dedicated Sidecar proxy for AI inference model routing and batching, Service Mesh for inter-service communication, and API Gateway for external ingress. Three layers, each with its own responsibility, non-interfering.
Recommended Online Tools
- YAML/JSON Formatter: /en/json/format — Format K8s YAML configurations
- Base64 Encode/Decode: /en/encode/base64 — Handle certificates and keys in Secrets
- curl to Code: /en/dev/curl-to-code — Quickly generate API test code
Related Reading
- K8s Gateway API Service Mesh Traffic Management Migration Guide — Deep dive into Gateway API in service mesh
- Python AI Model Production Deployment — Complete deployment from development to production
- K8s HPA Autoscaling Production Practices — Autoscaling strategies for AI inference services
Summary
K8s Sidecar AI inference proxies have become the standard architecture pattern for AI inference deployment in 2026. The 7 production patterns cover the complete chain from traffic interception to observability: Envoy traffic interception achieves zero business code changes, smart model routing dynamically selects models based on request characteristics, A/B testing and canary deployment ensure safe model rollouts, multi-model serving enables version coexistence, GPU resource pooling increases utilization from 30% to 80%+, batch merging boosts throughput 5-8x, and OpenTelemetry full-chain tracing makes performance bottlenecks visible. The core principle: Sidecar proxies focus on inference logic, business containers focus on business logic, communicating via localhost with zero coupling and zero intrusion.
External References:
Try these browser-local tools — no sign-up required →