Cloud-Native AI Deployment: The Complete Guide (Docker+K8s+GPU Scheduling)

技术架构

In 2026, AI Not on the Cloud Means AI Not in Production

Model training happens in the lab, inference serving runs in the cloud — this is the iron law of AI engineering.

A brutal reality: your 7B model runs blazingly fast on your laptop, but collapses in production due to misconfigured GPU scheduling, container errors, and broken autoscaling. AI deployment is not just writing a Dockerfile.

Cloud-Native AI Deployment Landscape

┌─────────────────────────────────────────────────────────────────────┐
│              Cloud-Native AI Deployment Architecture                 │
├─────────────┬─────────────┬──────────────┬─────────────────────────┤
│ Model Serve │ Vector Svc  │ Microservices│   Infrastructure        │
│             │             │              │                         │
│ vLLM/Triton │ Milvus      │ Spring Boot  │ K8s + GPU Operator     │
│ TGI/Ollama  │ Elastic-    │ AI Gateway   │ NVIDIA MIG + Time Slicing│
│ Model Router│ search      │ Ingress      │ Prometheus + Grafana    │
│ HPA Scale   │ Embedding   │ Service Mesh │ ArgoCD + Kustomize     │
├─────────────┴─────────────┴──────────────┴─────────────────────────┤
│                  Docker Containerization + K8s Orchestration        │
│       Multi-stage Build · GPU-Aware Scheduling · Elastic Scale     │
│                        · GitOps Delivery                            │
└─────────────────────────────────────────────────────────────────────┘

Three Challenges of AI Deployment

Challenge 1: GPU Scarcity — Compute is Hard Currency

┌──────────────────────────────────────────────────┐
│           GPU Resource Supply-Demand Gap          │
│                                                    │
│   Need: 100 inference services × 1 GPU = 100 GPU  │
│   Reality: Cluster only has 8 × A100 = 8 GPU      │
│   Gap: 92 GPU ❌                                   │
│                                                    │
│   ┌──────┐  MIG Slice   ┌──────────────┐          │
│   │ A100 │ ────────→    │ 7 × MIG      │          │
│   │ 80GB │  Time Slice  │ + Time Reuse  │          │
│   └──────┘ ────────→    │ = 14+ vGPU   │          │
│                         └──────────────┘          │
└──────────────────────────────────────────────────┘
GPU Model VRAM MIG Slices Suitable Model Size Monthly Cost
A100 80GB 80GB 7 × 10GB 7B-13B $2,100
A100 40GB 40GB 4 × 10GB 7B $1,400
H100 80GB 80GB 7 × 10GB 7B-70B $3,500
L40S 48GB 48GB No MIG 7B-13B $850
T4 16GB 16GB No MIG 1B-3B $280

Challenge 2: Model Size Explosion — Storage & Transfer Nightmare

Model Parameters FP16 Size INT8 Size INT4 Size
Qwen2.5-7B 7B 14GB 7GB 3.5GB
Llama3.1-13B 13B 26GB 13GB 6.5GB
Qwen2.5-72B 72B 144GB 72GB 36GB
DeepSeek-V3-671B 671B 1.3TB 671GB 335GB

Model pull time: Pulling a 72B FP16 model from Docker Registry takes 30+ minutes; with INT4 quantization, only 8 minutes.

Challenge 3: Latency vs Throughput — Can't Have Both

┌─────────────────────────────────────────────────────┐
│           Latency vs Throughput Trade-off Curve      │
│                                                       │
│  Throughput │         ╱                               │
│  (req/sec)  │        ╱                                │
│             │       ╱    ← Larger batch, more throughput│
│             │      ╱                                  │
│             │     ╱                                   │
│             │    ╱                                    │
│             │───╱──────────────────────               │
│             │  ╱    ← But latency increases too!       │
│             └─────────────────────── Latency(ms)      │
│                                                       │
│  Optimal: Dynamic Batching + Continuous Batching      │
└─────────────────────────────────────────────────────┘
Strategy Latency Throughput Use Case
Single Request Serial Lowest Lowest Real-time Chat
Static Batching Medium Medium Offline Inference
Continuous Batching Low High Online Serving
Speculative Decoding Lower High Real-time + High Throughput

AI Model Serving: Four Frameworks Compared

vLLM vs Triton vs TGI vs Ollama

Dimension vLLM Triton Inference Server TGI (Text Generation Inference) Ollama
Developer UC Berkeley NVIDIA HuggingFace Community
Core Strength PagedAttention Multi-framework Flash Attention Simplicity
Continuous Batching ✅ Native ✅ Dynamic Batching ✅ Native
Tensor Parallel
Streaming Output ✅ SSE ✅ SSE
OpenAI-Compatible API Needs Adapter
Quantization Support AWQ/GPTQ/GGUF INT8/FP8 AWQ/GPTQ/bitsandbytes GGUF
GPU Utilization 90%+ 80%+ 85%+ 60%+
K8s Friendliness ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐
Production Ready ⚠️ Dev/Test
Multi-Model Serving Single per instance Multiple per instance Single per instance Multiple per instance
Hot Model Loading

Architecture Comparison

┌─────────────────────────────────────────────────────────────────┐
│                      vLLM Architecture                           │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Request  │───→│ Scheduler    │───→│ PagedAttention  │        │
│  │ Queue    │    │ (Continuous  │    │ (KV Block Mgmt) │        │
│  │          │    │  Batching)   │    │                 │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    OpenAI API    Dynamic Batch         GPU HBM                  │
│    Compatible    Size Control          KV Cache                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Triton Architecture                           │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Model    │───→│ Dynamic      │───→│ Backend         │        │
│  │ Repo     │    │ Batcher     │    │ (PyTorch/TF/    │        │
│  │          │    │              │    │  ONRT/TensorRT) │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    Multi-model    Priority Queue    Multi-framework Engine      │
└─────────────────────────────────────────────────────────────────┘

vLLM Quick Start

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    quantization="awq",
)

params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048,
)

outputs = llm.generate(["Explain the core challenges of cloud-native AI deployment"], params)
for output in outputs:
    print(output.outputs[0].text)

TGI Launch Example

docker run --gpus all -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id Qwen/Qwen2.5-7B-Instruct \
  --quantize awq \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --cuda-memory-fraction 0.9

Selection Guide

┌──────────────────────────────────────────────────────┐
│         Model Serving Framework Decision Tree         │
│                                                        │
│  Need maximum GPU utilization?                         │
│    ├─ Yes → vLLM (PagedAttention, 90%+ utilization)   │
│    └─ No ↓                                             │
│  Need multi-framework/multi-model coexistence?         │
│    ├─ Yes → Triton (multiple models per instance)      │
│    └─ No ↓                                             │
│  Need seamless HuggingFace ecosystem integration?      │
│    ├─ Yes → TGI (Flash Attention + HF Hub)             │
│    └─ No ↓                                             │
│  Just local dev/testing?                               │
│    └─ Yes → Ollama (one command to start)              │
└──────────────────────────────────────────────────────┘

Dockerizing AI Applications: Multi-Stage Build + Compose

Multi-Stage Dockerfile

# ===== Stage 1: Model Download =====
FROM python:3.11-slim AS model-downloader

RUN pip install --no-cache-dir huggingface-hub

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ARG QUANTIZATION=awq

RUN huggingface-cli download \
    ${MODEL_ID} \
    --local-dir /models/${MODEL_ID} \
    --exclude "*.safetensors" \
    && echo "Model config downloaded"

# ===== Stage 2: Inference Runtime =====
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3-pip python3.11-venv \
    && rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/vllm
ENV PATH="/opt/vllm/bin:${PATH}"

RUN pip install --no-cache-dir \
    vllm==0.8.0 \
    transformers>=4.45.0 \
    accelerate

COPY --from=model-downloader /models /models

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ENV MODEL_ID=${MODEL_ID}

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/Qwen/Qwen2.5-7B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "2", \
     "--gpu-memory-utilization", "0.9", \
     "--max-model-len", "8192"]

Docker Compose Local Dev Environment

version: "3.8"

services:
  vllm-server:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        MODEL_ID: Qwen/Qwen2.5-7B-Instruct
        QUANTIZATION: awq
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    volumes:
      - model-cache:/root/.cache/huggingface
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: unless-stopped

  milvus-standalone:
    image: milvusdb/milvus:v2.4.17
    ports:
      - "19530:19530"
      - "9091:9091"
    environment:
      - ETCD_USE_EMBED=true
      - COMMON_STORAGETYPE=local
    volumes:
      - milvus-data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 15s
      timeout: 5s
      retries: 3

  elasticsearch:
    image: elasticsearch:8.15.0
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es-data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health"]
      interval: 15s
      timeout: 5s
      retries: 3

  springboot-ai:
    build:
      context: ../springboot-ai-service
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - SPRING_AI_OPENAI_BASE-URL=http://vllm-server:8000/v1
      - SPRING_AI_OPENAI_API-KEY=not-needed
      - MILVUS_HOST=milvus-standalone
      - MILVUS_PORT=19530
      - ELASTICSEARCH_URIS=http://elasticsearch:9200
    depends_on:
      vllm-server:
        condition: service_healthy
      milvus-standalone:
        condition: service_healthy
      elasticsearch:
        condition: service_healthy

volumes:
  model-cache:
  milvus-data:
  es-data:

Docker Image Optimization Tips

Technique Effect Example
Multi-stage build Image size -60%+ Build tools excluded from runtime
Model quantization Size -50% to -75% FP16→INT4
.dockerignore Build context reduction Exclude .git, pycache
Layer cache reuse Build time -80% Unchanged dependency layers reused
Base image selection 10x size difference slim vs full
External model volume Image excludes model data Mount at runtime

K8s GPU Scheduling: NVIDIA GPU Operator + MIG + Time Slicing

NVIDIA GPU Operator Installation

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: gpu-operator
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  valuesContent: |
    driver:
      enabled: true
      version: "550.54.15"
    toolkit:
      enabled: true
      version: "v1.15.0"
    devicePlugin:
      enabled: true
      config:
        name: mig-config
        shared:
          timeSlicing:
            resources:
              - name: nvidia.com/gpu
                replicas: 4
    migManager:
      enabled: true
      config:
        name: mig-partition-config

MIG Partition Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-partition-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 14
      mixed:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 2
            "1g.10gb": 3
        - devices: [1]
          mig-enabled: false

GPU Time Slicing Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  a100-80gb.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
            devices: all
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    spec:
      containers:
        - name: nvidia-device-plugin
          image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
          args:
            - --config-file=/etc/nvidia-plugin/config.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/nvidia-plugin
      volumes:
        - name: config
          configMap:
            name: time-slicing-config

Pod Using MIG Instance

apiVersion: v1
kind: Pod
metadata:
  name: vllm-mig-pod
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:v0.8.0
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1
      command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - Qwen/Qwen2.5-7B-Instruct-AWQ
        - --max-model-len
        - "4096"
        - --gpu-memory-utilization
        - "0.9"

GPU Scheduling Strategy Comparison

Strategy Principle Pros Cons Use Case
Exclusive GPU 1 Pod = 1 GPU Zero interference, stable perf Resource waste Large model inference
MIG Partition Hardware-level isolation True isolation, configurable sizes A100/H100 only Multi small model serving
Time Slicing Software-level multiplexing Works on all GPUs Context switch overhead Low-priority tasks
MPS Shared GPU context Low-overhead sharing No isolation, one crash affects all Same-team multi-task

Model Inference Service Deployment: vLLM + HPA Autoscaling

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b
  namespace: ai-inference
  labels:
    app: vllm-qwen7b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b
  template:
    metadata:
      labels:
        app: vllm-qwen7b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: vllm-qwen7b
                topologyKey: kubernetes.io/hostname
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 2
            requests:
              nvidia.com/gpu: 2
              cpu: "4"
              memory: 16Gi
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen2.5-7B-Instruct-AWQ"
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - Qwen/Qwen2.5-7B-Instruct-AWQ
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --tensor-parallel-size
            - "2"
            - --gpu-memory-utilization
            - "0.9"
            - --max-model-len
            - "8192"
            - --enable-prefix-caching
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
      nodeSelector:
        gpu-type: a100-80gb
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

HPA Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-qwen7b-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen7b
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "10"
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage_perc
        target:
          type: AverageValue
          averageValue: "70"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Service + PodMonitor

apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen7b-svc
  namespace: ai-inference
spec:
  selector:
    app: vllm-qwen7b
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-qwen7b-monitor
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: vllm-qwen7b
  podMetricsEndpoints:
    - port: http
      path: /metrics
      interval: 15s

Key vLLM Monitoring Metrics

Metric Description Alert Threshold
vllm:num_requests_running Active requests being processed > 80% max_num_seqs
vllm:num_requests_waiting Requests in waiting queue > 10 for 5 minutes
vllm:gpu_cache_usage_perc KV Cache utilization > 90%
vllm:avg_generation_throughput Average generation throughput < 100 tok/s
vllm:e2e_request_latency_seconds End-to-end request latency P99 > 10s
vllm:num_preemptions Preemption count > 0

RAG Vector Service K8s Deployment: Milvus Cluster + Elasticsearch

Overall Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    RAG Vector Service Architecture                │
│                                                                    │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐        │
│  │  User     │───→│ Spring Boot  │───→│ vLLM Embedding   │        │
│  │  Query    │    │ AI Gateway   │    │ (Vector Gen)     │        │
│  └──────────┘    └──────┬───────┘    └──────────────────┘        │
│                         │                                        │
│              ┌──────────┼──────────┐                              │
│              ▼                     ▼                              │
│     ┌──────────────┐    ┌──────────────┐                         │
│     │   Milvus     │    │Elasticsearch │                         │
│     │  (Vector     │    │ (Full-text   │                         │
│     │   Search)    │    │  Search)     │                         │
│     │  HNSW/IVF   │    │  BM25+Dense  │                         │
│     └──────────────┘    └──────────────┘                         │
│            │                    │                                │
│            └────────┬───────────┘                                │
│                     ▼                                            │
│              ┌──────────────┐    ┌──────────────────┐           │
│              │  Hybrid      │───→│ vLLM LLM         │           │
│              │  Results     │    │ (Answer Gen)     │           │
│              │  Rerank      │    │                  │           │
│              └──────────────┘    └──────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Milvus Cluster Deployment

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: milvus
  namespace: ai-rag
spec:
  serviceName: milvus-headless
  replicas: 3
  selector:
    matchLabels:
      app: milvus
  template:
    metadata:
      labels:
        app: milvus
    spec:
      containers:
        - name: milvus
          image: milvusdb/milvus:v2.4.17
          ports:
            - containerPort: 19530
            - containerPort: 9091
          resources:
            requests:
              cpu: "4"
              memory: 8Gi
            limits:
              cpu: "8"
              memory: 16Gi
          env:
            - name: ETCD_ENDPOINTS
              value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
            - name: MINIO_ADDRESS
              value: "minio:9000"
            - name: COMMON_STORAGETYPE
              value: "remote"
          volumeMounts:
            - name: milvus-data
              mountPath: /var/lib/milvus
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 15
            periodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: milvus-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: milvus-svc
  namespace: ai-rag
spec:
  selector:
    app: milvus
  ports:
    - name: grpc
      port: 19530
      targetPort: 19530
    - name: metrics
      port: 9091
      targetPort: 9091
  type: ClusterIP

Elasticsearch Deployment

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: ai-rag
spec:
  serviceName: elasticsearch-headless
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
        - name: sysctl
          image: busybox
          command: ["sysctl", "-w", "vm.max_map_count=262144"]
          securityContext:
            privileged: true
      containers:
        - name: elasticsearch
          image: elasticsearch:8.15.0
          ports:
            - containerPort: 9200
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
          env:
            - name: cluster.name
              value: "ai-rag-cluster"
            - name: discovery.seed_hosts
              value: "elasticsearch-0.elasticsearch-headless,elasticsearch-1.elasticsearch-headless,elasticsearch-2.elasticsearch-headless"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: xpack.security.enabled
              value: "false"
            - name: ES_JAVA_OPTS
              value: "-Xms4g -Xmx4g"
          volumeMounts:
            - name: es-data
              mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
    - metadata:
        name: es-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 200Gi

Milvus Collection Creation

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="http://milvus-svc.ai-rag.svc:19530")

schema = client.create_schema(auto_id=True, enable_dynamic_field=True)

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="timestamp", datatype=DataType.INT64)

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 32, "efConstruction": 256}
)
index_params.add_index(
    field_name="text",
    index_type="INVERTED"
)

client.create_collection(
    collection_name="knowledge_base",
    schema=schema,
    index_params=index_params,
)
Search Type Principle Advantage Disadvantage Use Case
Vector Search Semantic similarity Understands semantics Weak exact match Semantic Q&A
Full-text Search BM25 keywords Exact matching No semantic understanding Keyword search
Hybrid Search Vector + full-text fusion Best of both worlds Higher compute cost Production RAG
Rerank Cross-encoder reranking Highest accuracy Slower High-precision scenarios

Spring Boot AI Microservices on K8s

Spring Boot AI Application Dockerfile

FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY gradle/ gradle/
COPY gradlew build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon
COPY src/ src/
RUN ./gradlew bootJar --no-daemon -x test

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar

ENV JAVA_OPTS="-XX:+UseZGC -XX:MaxRAMPercentage=75.0"
ENV SPRING_PROFILES_ACTIVE=k8s

EXPOSE 8080

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD wget -qO- http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["sh", "-c", "java ${JAVA_OPTS} -jar app.jar"]

K8s Deployment + Service + Ingress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: springboot-ai-service
  namespace: ai-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: springboot-ai-service
  template:
    metadata:
      labels:
        app: springboot-ai-service
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      containers:
        - name: app
          image: registry.example.com/springboot-ai-service:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "2"
              memory: 4Gi
          env:
            - name: SPRING_AI_OPENAI_BASE-URL
              value: "http://vllm-qwen7b-svc.ai-inference.svc:8000/v1"
            - name: SPRING_AI_OPENAI_API-KEY
              valueFrom:
                secretKeyRef:
                  name: ai-api-keys
                  key: openai-key
            - name: MILVUS_HOST
              value: "milvus-svc.ai-rag.svc"
            - name: MILVUS_PORT
              value: "19530"
            - name: ELASTICSEARCH_URIS
              value: "http://elasticsearch-svc.ai-rag.svc:9200"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 5
          volumeMounts:
            - name: config
              mountPath: /app/config
      volumes:
        - name: config
          configMap:
            name: springboot-ai-config
---
apiVersion: v1
kind: Service
metadata:
  name: springboot-ai-svc
  namespace: ai-app
spec:
  selector:
    app: springboot-ai-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: springboot-ai-ingress
  namespace: ai-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-AI-Service: springboot-ai";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - ai-api.example.com
      secretName: ai-api-tls
  rules:
    - host: ai-api.example.com
      http:
        paths:
          - path: /api/ai
            pathType: Prefix
            backend:
              service:
                name: springboot-ai-svc
                port:
                  number: 80

Spring Boot AI Core Configuration

spring:
  ai:
    openai:
      base-url: http://vllm-qwen7b-svc.ai-inference.svc:8000/v1
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: Qwen/Qwen2.5-7B-Instruct-AWQ
          temperature: 0.7
          max-tokens: 2048
    vectorstore:
      milvus:
        client:
          host: ${MILVUS_HOST:milvus-svc.ai-rag.svc}
          port: ${MILVUS_PORT:19530}
        database-name: default
        collection-name: knowledge_base
        embedding-dimension: 1024
        metric-type: COSINE
        index-type: HNSW

  elasticsearch:
    uris: ${ELASTICSEARCH_URIS:http://elasticsearch-svc.ai-rag.svc:9200}

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: springboot-ai-service
  tracing:
    sampling:
      probability: 1.0
  otlp:
    metrics:
      export:
        url: http://otel-collector.observability.svc:4318/v1/metrics
    tracing:
      endpoint: http://otel-collector.observability.svc:4318/v1/traces

Observability: Prometheus + Grafana + OpenTelemetry

Monitoring Architecture

┌──────────────────────────────────────────────────────────────────┐
│               Full-Stack Observability Architecture               │
│                                                                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ vLLM     │  │ Milvus   │  │ES        │  │Spring    │        │
│  │ /metrics │  │ /metrics │  │/_prom    │  │/actuator │        │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘        │
│       │              │              │              │              │
│       └──────────────┴──────────────┴──────────────┘              │
│                            │                                      │
│                   ┌────────▼────────┐                             │
│                   │  OTel Collector │                             │
│                   │ (Receive+Process│                             │
│                   │  +Export)       │                             │
│                   └────────┬────────┘                             │
│              ┌─────────────┼─────────────┐                        │
│              ▼             ▼             ▼                        │
│     ┌──────────────┐ ┌──────────┐ ┌──────────┐                  │
│     │  Prometheus   │ │  Jaeger  │ │  Loki    │                  │
│     │  (Metrics)    │ │ (Traces) │ │ (Logs)   │                  │
│     └──────┬───────┘ └──────────┘ └──────────┘                  │
│            ▼                                                      │
│     ┌──────────────┐                                              │
│     │   Grafana    │                                              │
│     │  (Dashboard) │                                              │
│     └──────────────┘                                              │
└──────────────────────────────────────────────────────────────────┘

OpenTelemetry Collector Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          ports:
            - containerPort: 4318
            - containerPort: 4317
          resources:
            requests:
              cpu: "500m"
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: config
              mountPath: /etc/otelcol-contrib
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
          grpc:
            endpoint: 0.0.0.0:4317
      prometheus:
        config:
          scrape_configs:
            - job_name: "vllm"
              scrape_interval: 15s
              static_configs:
                - targets: ["vllm-qwen7b-svc.ai-inference.svc:8000"]
            - job_name: "milvus"
              scrape_interval: 15s
              static_configs:
                - targets: ["milvus-svc.ai-rag.svc:9091"]

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 5s
        limit_mib: 400

    exporters:
      prometheusremotewrite:
        endpoint: http://prometheus.observability.svc:9090/api/v1/write
      otlphttp:
        endpoint: http://jaeger-collector.observability.svc:4318
      loki:
        endpoint: http://loki.observability.svc:3100/loki/api/v1/push

    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlphttp]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Prometheus Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-inference-alerts
  namespace: observability
spec:
  groups:
    - name: vllm-alerts
      rules:
        - alert: VLLMHighQueueDepth
          expr: vllm_num_requests_waiting > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM request queue depth too high"
            description: "Instance {{ $labels.instance }} has {{ $value }} waiting requests"

        - alert: VLLMHighLatency
          expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 10
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "vLLM P99 latency too high"
            description: "Instance {{ $labels.instance }} P99 latency {{ $value }}s"

        - alert: VLLMGPUCacheFull
          expr: vllm_gpu_cache_usage_perc > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM GPU cache utilization too high"
            description: "Instance {{ $labels.instance }} cache usage {{ $value }}%"

        - alert: VLLMPreemptions
          expr: increase(vllm_num_preemptions[5m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "vLLM request preemptions detected"
            description: "Instance {{ $labels.instance }} had {{ $value }} preemptions in 5m"

    - name: milvus-alerts
      rules:
        - alert: MilvusHighQueryLatency
          expr: histogram_quantile(0.99, rate(milvus_proxy_sq_latency_bucket[5m])) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Milvus query latency too high"

        - alert: MilvusInsertRateLow
          expr: rate(milvus_proxy_insert_vectors_count[5m]) < 10
          for: 10m
          labels:
            severity: info
          annotations:
            summary: "Milvus insert rate abnormally low"

Key Grafana Dashboard Panels

Panel Metric Visualization
Inference QPS rate(vllm_num_requests_total[1m]) Line chart
P50/P90/P99 Latency histogram_quantile Line chart
GPU Cache Usage vllm_gpu_cache_usage_perc Gauge
Request Queue Depth vllm_num_requests_waiting Bar chart
Vector Search Latency milvus_proxy_sq_latency Line chart
Spring Boot JVM jvm_memory_used_bytes Area chart
HTTP Error Rate rate(http_server_requests_seconds_count{status=~"5.."}[5m]) Line chart
Distributed Traces Jaeger integration Table + link

Cost Optimization: Quantization + MIG Reuse + Dynamic Scaling + Spot Instances

Model Quantization Comparison

Method Accuracy Loss Size Reduction Inference Speedup Compatibility
FP16 (baseline) 0% 1x 1x All
INT8 (PTQ) <1% 2x 1.5x A100/H100
INT4 (GPTQ) 1-3% 4x 1.8x Needs calibration data
INT4 (AWQ) 1-2% 4x 2x Needs calibration data
INT4 (GGUF) 2-5% 4x 1.5x CPU/GPU universal
FP8 <0.5% 2x 2x+ H100 native

Cost Optimization Strategy Overview

┌──────────────────────────────────────────────────────────────────┐
│              Four Pillars of Cost Optimization                    │
│                                                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────┐ │
│  │ 1.Quantize   │  │ 2.MIG Reuse  │  │ 3.Dynamic    │  │4.Spot│ │
│  │              │  │              │  │    Scaling    │  │Inst. │ │
│  │ FP16→INT4   │  │ 1GPU→7MIG   │  │ HPA+VPA     │  │Bid   │ │
│  │ Size↓75%    │  │ Util↑7x     │  │ Idle scale   │  │Cost  │ │
│  │ Speed↑2x    │  │ Cost↓85%    │  │ Cost↓40%    │  │↓70%  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────┘ │
│                                                                    │
│  Combined effect: Cost reduced to 5%-15% of baseline              │
└──────────────────────────────────────────────────────────────────┘

Cost Calculation Example

Plan GPU Config Monthly Cost Notes
Baseline: FP16 Exclusive 4 × A100 80GB $8,400 2 replicas × 2 GPU
INT4 Quantized 2 × A100 80GB $4,200 2 replicas × 1 GPU
INT4 + MIG 1 × A100 80GB $2,100 1 replica × 1 MIG instance
INT4 + MIG + HPA 1 × A100 80GB (idle 0.5) $1,400 Scale to 1 replica when idle
INT4 + MIG + HPA + Spot 1 × A100 Spot $420 Spot discount 70%

Spot Instance Strategy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b-spot
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b-spot
  template:
    metadata:
      labels:
        app: vllm-qwen7b-spot
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: cloud.google.com/gke-preemptible
                    operator: In
                    values: ["true"]
      tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          env:
            - name: GRACEFUL_SHUTDOWN
              value: "true"
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]

Quantized Model Loading Configuration

from vllm import LLM

llm_awq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    enforce_eager=True,
)

llm_gptq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    quantization="gptq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
)

GitOps: ArgoCD + Kustomize Automated Delivery

GitOps Workflow

┌──────────────────────────────────────────────────────────────────┐
│                    GitOps Delivery Pipeline                       │
│                                                                    │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ Developer│───→│ CI Build │───→│ Image    │───→│ Git Repo │  │
│  │ Commit   │    │ & Test   │    │ Push     │    │ Manifest │  │
│  └──────────┘    └──────────┘    └──────────┘    └────┬─────┘  │
│                                                       │          │
│                                              ┌────────▼───────┐  │
│                                              │   ArgoCD       │  │
│                                              │  Detect Change │  │
│                                              │  Auto Sync     │  │
│                                              └────────┬───────┘  │
│                                                       │          │
│                                    ┌──────────────────┼──────┐   │
│                                    ▼                  ▼      ▼   │
│                              ┌──────────┐   ┌──────────┐  ┌──┐ │
│                              │  Dev     │   │  Staging  │  │PR││
│                              │  Env     │   │  Env      │  │OD│ │
│                              └──────────┘   └──────────┘  └──┘ │
└──────────────────────────────────────────────────────────────────┘

Kustomize Directory Structure

k8s/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── hpa.yaml
│   └── ingress.yaml
├── overlays/
│   ├── dev/
│   │   ├── kustomization.yaml
│   │   └── patch-replicas.yaml
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patch-resources.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── patch-replicas.yaml
│       ├── patch-resources.yaml
│       └── patch-hpa.yaml

Base Kustomization

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - hpa.yaml
  - ingress.yaml

commonLabels:
  app.kubernetes.io/part-of: ai-inference-platform
  app.kubernetes.io/managed-by: kustomize

images:
  - name: vllm-server
    newName: registry.example.com/vllm-openai
    newTag: v0.8.0

configMapGenerator:
  - name: vllm-config
    literals:
      - MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-AWQ
      - MAX_MODEL_LEN=8192
      - GPU_MEMORY_UTILIZATION=0.9

secretGenerator:
  - name: ai-api-keys
    literals:
      - openai-key=placeholder

Production Overlay

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ai-inference-production

resources:
  - ../../base

patches:
  - target:
      kind: Deployment
      name: vllm-qwen7b
    patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: vllm-qwen7b
      spec:
        replicas: 4
        template:
          spec:
            containers:
              - name: vllm
                resources:
                  limits:
                    nvidia.com/gpu: 2
                  requests:
                    nvidia.com/gpu: 2
                    cpu: "4"
                    memory: 16Gi

  - target:
      kind: HorizontalPodAutoscaler
      name: vllm-qwen7b-hpa
    patch: |
      apiVersion: autoscaling/v2
      kind: HorizontalPodAutoscaler
      metadata:
        name: vllm-qwen7b-hpa
      spec:
        minReplicas: 4
        maxReplicas: 16

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-inference-platform
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-deployed.slack: ai-platform-deploys
spec:
  project: ai-platform
  source:
    repoURL: https://git.example.com/ai-platform/k8s-manifests.git
    targetRevision: main
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-inference-production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 5m

ArgoCD App of Apps Pattern

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-platform-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/ai-platform/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ai-platform
  namespace: argocd
spec:
  description: AI Inference Platform Project
  sourceRepos:
    - https://git.example.com/ai-platform/*
  destinations:
    - namespace: ai-inference-*
      server: https://kubernetes.default.svc
    - namespace: ai-rag-*
      server: https://kubernetes.default.svc
    - namespace: ai-app-*
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
  namespaceResourceBlacklist:
    - group: ""
      kind: ResourceQuota

Summary: Cloud-Native AI Deployment Checklist

Phase Checklist Item Status
Model Serving vLLM/Triton selection finalized
Model Serving Quantization strategy (INT4/AWQ) decided
Docker Multi-stage Dockerfile written
Docker Docker Compose local validation
K8s GPU GPU Operator installed
K8s GPU MIG/Time Slicing configured
Inference vLLM Deployment deployed
Inference HPA autoscaling configured
RAG Service Milvus cluster deployed
RAG Service Elasticsearch deployed
Microservices Spring Boot AI on K8s
Microservices Ingress routing configured
Observability Prometheus + Grafana
Observability OpenTelemetry distributed tracing
Observability Alert rules configured
Cost Optimization Model quantization validated
Cost Optimization Spot instance strategy
GitOps ArgoCD + Kustomize
GitOps Multi-environment Overlay config

Cloud-native AI deployment is not the destination, but the starting point. From Docker containerization to K8s orchestration, from GPU scheduling to observability — each step is engineering accumulation. Remember: deployment without monitoring is flying blind; operations without automation is manual labor. Automate everything with GitOps, and make AI inference services true first-class citizens of cloud-native.

Try these browser-local tools — no sign-up required →

#Kubernetes#Docker#GPU调度#vLLM#云原生#AI部署#ArgoCD