Cloud-Native AI Deployment: The Complete Guide (Docker+K8s+GPU Scheduling)

In 2026, AI Not on the Cloud Means AI Not in Production

Model training happens in the lab, inference serving runs in the cloud — this is the iron law of AI engineering.

A brutal reality: your 7B model runs blazingly fast on your laptop, but collapses in production due to misconfigured GPU scheduling, container errors, and broken autoscaling. AI deployment is not just writing a Dockerfile.

Cloud-Native AI Deployment Landscape

┌─────────────────────────────────────────────────────────────────────┐
│              Cloud-Native AI Deployment Architecture                 │
├─────────────┬─────────────┬──────────────┬─────────────────────────┤
│ Model Serve │ Vector Svc  │ Microservices│   Infrastructure        │
│             │             │              │                         │
│ vLLM/Triton │ Milvus      │ Spring Boot  │ K8s + GPU Operator     │
│ TGI/Ollama  │ Elastic-    │ AI Gateway   │ NVIDIA MIG + Time Slicing│
│ Model Router│ search      │ Ingress      │ Prometheus + Grafana    │
│ HPA Scale   │ Embedding   │ Service Mesh │ ArgoCD + Kustomize     │
├─────────────┴─────────────┴──────────────┴─────────────────────────┤
│                  Docker Containerization + K8s Orchestration        │
│       Multi-stage Build · GPU-Aware Scheduling · Elastic Scale     │
│                        · GitOps Delivery                            │
└─────────────────────────────────────────────────────────────────────┘

Three Challenges of AI Deployment

Challenge 1: GPU Scarcity — Compute is Hard Currency

┌──────────────────────────────────────────────────┐
│           GPU Resource Supply-Demand Gap          │
│                                                    │
│   Need: 100 inference services × 1 GPU = 100 GPU  │
│   Reality: Cluster only has 8 × A100 = 8 GPU      │
│   Gap: 92 GPU ❌                                   │
│                                                    │
│   ┌──────┐  MIG Slice   ┌──────────────┐          │
│   │ A100 │ ────────→    │ 7 × MIG      │          │
│   │ 80GB │  Time Slice  │ + Time Reuse  │          │
│   └──────┘ ────────→    │ = 14+ vGPU   │          │
│                         └──────────────┘          │
└──────────────────────────────────────────────────┘

GPU Model	VRAM	MIG Slices	Suitable Model Size	Monthly Cost
A100 80GB	80GB	7 × 10GB	7B-13B	$2,100
A100 40GB	40GB	4 × 10GB	7B	$1,400
H100 80GB	80GB	7 × 10GB	7B-70B	$3,500
L40S 48GB	48GB	No MIG	7B-13B	$850
T4 16GB	16GB	No MIG	1B-3B	$280

Challenge 2: Model Size Explosion — Storage & Transfer Nightmare

Model	Parameters	FP16 Size	INT8 Size	INT4 Size
Qwen2.5-7B	7B	14GB	7GB	3.5GB
Llama3.1-13B	13B	26GB	13GB	6.5GB
Qwen2.5-72B	72B	144GB	72GB	36GB
DeepSeek-V3-671B	671B	1.3TB	671GB	335GB

Model pull time: Pulling a 72B FP16 model from Docker Registry takes 30+ minutes; with INT4 quantization, only 8 minutes.

Challenge 3: Latency vs Throughput — Can't Have Both

┌─────────────────────────────────────────────────────┐
│           Latency vs Throughput Trade-off Curve      │
│                                                       │
│  Throughput │         ╱                               │
│  (req/sec)  │        ╱                                │
│             │       ╱    ← Larger batch, more throughput│
│             │      ╱                                  │
│             │     ╱                                   │
│             │    ╱                                    │
│             │───╱──────────────────────               │
│             │  ╱    ← But latency increases too!       │
│             └─────────────────────── Latency(ms)      │
│                                                       │
│  Optimal: Dynamic Batching + Continuous Batching      │
└─────────────────────────────────────────────────────┘

Strategy	Latency	Throughput	Use Case
Single Request Serial	Lowest	Lowest	Real-time Chat
Static Batching	Medium	Medium	Offline Inference
Continuous Batching	Low	High	Online Serving
Speculative Decoding	Lower	High	Real-time + High Throughput

AI Model Serving: Four Frameworks Compared

vLLM vs Triton vs TGI vs Ollama

Dimension	vLLM	Triton Inference Server	TGI (Text Generation Inference)	Ollama
Developer	UC Berkeley	NVIDIA	HuggingFace	Community
Core Strength	PagedAttention	Multi-framework	Flash Attention	Simplicity
Continuous Batching	✅ Native	✅ Dynamic Batching	✅ Native	❌
Tensor Parallel	✅	✅	✅	❌
Streaming Output	✅ SSE	✅	✅ SSE	✅
OpenAI-Compatible API	✅	Needs Adapter	✅	✅
Quantization Support	AWQ/GPTQ/GGUF	INT8/FP8	AWQ/GPTQ/bitsandbytes	GGUF
GPU Utilization	90%+	80%+	85%+	60%+
K8s Friendliness	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Production Ready	✅	✅	✅	⚠️ Dev/Test
Multi-Model Serving	Single per instance	Multiple per instance	Single per instance	Multiple per instance
Hot Model Loading	❌	✅	❌	✅

Architecture Comparison

┌─────────────────────────────────────────────────────────────────┐
│                      vLLM Architecture                           │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Request  │───→│ Scheduler    │───→│ PagedAttention  │        │
│  │ Queue    │    │ (Continuous  │    │ (KV Block Mgmt) │        │
│  │          │    │  Batching)   │    │                 │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    OpenAI API    Dynamic Batch         GPU HBM                  │
│    Compatible    Size Control          KV Cache                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Triton Architecture                           │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────────┐        │
│  │ Model    │───→│ Dynamic      │───→│ Backend         │        │
│  │ Repo     │    │ Batcher     │    │ (PyTorch/TF/    │        │
│  │          │    │              │    │  ONRT/TensorRT) │        │
│  └─────────┘    └──────────────┘    └─────────────────┘        │
│       ↑               ↑                     ↑                   │
│    Multi-model    Priority Queue    Multi-framework Engine      │
└─────────────────────────────────────────────────────────────────┘

vLLM Quick Start

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    quantization="awq",
)

params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048,
)

outputs = llm.generate(["Explain the core challenges of cloud-native AI deployment"], params)
for output in outputs:
    print(output.outputs[0].text)

TGI Launch Example

docker run --gpus all -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id Qwen/Qwen2.5-7B-Instruct \
  --quantize awq \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --cuda-memory-fraction 0.9

Selection Guide

┌──────────────────────────────────────────────────────┐
│         Model Serving Framework Decision Tree         │
│                                                        │
│  Need maximum GPU utilization?                         │
│    ├─ Yes → vLLM (PagedAttention, 90%+ utilization)   │
│    └─ No ↓                                             │
│  Need multi-framework/multi-model coexistence?         │
│    ├─ Yes → Triton (multiple models per instance)      │
│    └─ No ↓                                             │
│  Need seamless HuggingFace ecosystem integration?      │
│    ├─ Yes → TGI (Flash Attention + HF Hub)             │
│    └─ No ↓                                             │
│  Just local dev/testing?                               │
│    └─ Yes → Ollama (one command to start)              │
└──────────────────────────────────────────────────────┘

Dockerizing AI Applications: Multi-Stage Build + Compose

Multi-Stage Dockerfile

# ===== Stage 1: Model Download =====
FROM python:3.11-slim AS model-downloader

RUN pip install --no-cache-dir huggingface-hub

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ARG QUANTIZATION=awq

RUN huggingface-cli download \
    ${MODEL_ID} \
    --local-dir /models/${MODEL_ID} \
    --exclude "*.safetensors" \
    && echo "Model config downloaded"

# ===== Stage 2: Inference Runtime =====
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3-pip python3.11-venv \
    && rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/vllm
ENV PATH="/opt/vllm/bin:${PATH}"

RUN pip install --no-cache-dir \
    vllm==0.8.0 \
    transformers>=4.45.0 \
    accelerate

COPY --from=model-downloader /models /models

ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ENV MODEL_ID=${MODEL_ID}

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/Qwen/Qwen2.5-7B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "2", \
     "--gpu-memory-utilization", "0.9", \
     "--max-model-len", "8192"]

Docker Compose Local Dev Environment

version: "3.8"

services:
  vllm-server:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        MODEL_ID: Qwen/Qwen2.5-7B-Instruct
        QUANTIZATION: awq
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    volumes:
      - model-cache:/root/.cache/huggingface
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: unless-stopped

  milvus-standalone:
    image: milvusdb/milvus:v2.4.17
    ports:
      - "19530:19530"
      - "9091:9091"
    environment:
      - ETCD_USE_EMBED=true
      - COMMON_STORAGETYPE=local
    volumes:
      - milvus-data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 15s
      timeout: 5s
      retries: 3

  elasticsearch:
    image: elasticsearch:8.15.0
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es-data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health"]
      interval: 15s
      timeout: 5s
      retries: 3

  springboot-ai:
    build:
      context: ../springboot-ai-service
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - SPRING_AI_OPENAI_BASE-URL=http://vllm-server:8000/v1
      - SPRING_AI_OPENAI_API-KEY=not-needed
      - MILVUS_HOST=milvus-standalone
      - MILVUS_PORT=19530
      - ELASTICSEARCH_URIS=http://elasticsearch:9200
    depends_on:
      vllm-server:
        condition: service_healthy
      milvus-standalone:
        condition: service_healthy
      elasticsearch:
        condition: service_healthy

volumes:
  model-cache:
  milvus-data:
  es-data:

Docker Image Optimization Tips

Technique	Effect	Example
Multi-stage build	Image size -60%+	Build tools excluded from runtime
Model quantization	Size -50% to -75%	FP16→INT4
.dockerignore	Build context reduction	Exclude .git, pycache
Layer cache reuse	Build time -80%	Unchanged dependency layers reused
Base image selection	10x size difference	slim vs full
External model volume	Image excludes model data	Mount at runtime

K8s GPU Scheduling: NVIDIA GPU Operator + MIG + Time Slicing

NVIDIA GPU Operator Installation

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: gpu-operator
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  valuesContent: |
    driver:
      enabled: true
      version: "550.54.15"
    toolkit:
      enabled: true
      version: "v1.15.0"
    devicePlugin:
      enabled: true
      config:
        name: mig-config
        shared:
          timeSlicing:
            resources:
              - name: nvidia.com/gpu
                replicas: 4
    migManager:
      enabled: true
      config:
        name: mig-partition-config

MIG Partition Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-partition-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 14
      mixed:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 2
            "1g.10gb": 3
        - devices: [1]
          mig-enabled: false

GPU Time Slicing Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  a100-80gb.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
            devices: all
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    spec:
      containers:
        - name: nvidia-device-plugin
          image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
          args:
            - --config-file=/etc/nvidia-plugin/config.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/nvidia-plugin
      volumes:
        - name: config
          configMap:
            name: time-slicing-config

Pod Using MIG Instance

apiVersion: v1
kind: Pod
metadata:
  name: vllm-mig-pod
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:v0.8.0
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1
      command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - Qwen/Qwen2.5-7B-Instruct-AWQ
        - --max-model-len
        - "4096"
        - --gpu-memory-utilization
        - "0.9"

GPU Scheduling Strategy Comparison

Strategy	Principle	Pros	Cons	Use Case
Exclusive GPU	1 Pod = 1 GPU	Zero interference, stable perf	Resource waste	Large model inference
MIG Partition	Hardware-level isolation	True isolation, configurable sizes	A100/H100 only	Multi small model serving
Time Slicing	Software-level multiplexing	Works on all GPUs	Context switch overhead	Low-priority tasks
MPS	Shared GPU context	Low-overhead sharing	No isolation, one crash affects all	Same-team multi-task

Model Inference Service Deployment: vLLM + HPA Autoscaling

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b
  namespace: ai-inference
  labels:
    app: vllm-qwen7b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b
  template:
    metadata:
      labels:
        app: vllm-qwen7b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: vllm-qwen7b
                topologyKey: kubernetes.io/hostname
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 2
            requests:
              nvidia.com/gpu: 2
              cpu: "4"
              memory: 16Gi
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen2.5-7B-Instruct-AWQ"
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - Qwen/Qwen2.5-7B-Instruct-AWQ
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --tensor-parallel-size
            - "2"
            - --gpu-memory-utilization
            - "0.9"
            - --max-model-len
            - "8192"
            - --enable-prefix-caching
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
      nodeSelector:
        gpu-type: a100-80gb
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

HPA Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-qwen7b-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen7b
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "10"
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage_perc
        target:
          type: AverageValue
          averageValue: "70"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Service + PodMonitor

apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen7b-svc
  namespace: ai-inference
spec:
  selector:
    app: vllm-qwen7b
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-qwen7b-monitor
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: vllm-qwen7b
  podMetricsEndpoints:
    - port: http
      path: /metrics
      interval: 15s

Key vLLM Monitoring Metrics

Metric	Description	Alert Threshold
`vllm:num_requests_running`	Active requests being processed	> 80% max_num_seqs
`vllm:num_requests_waiting`	Requests in waiting queue	> 10 for 5 minutes
`vllm:gpu_cache_usage_perc`	KV Cache utilization	> 90%
`vllm:avg_generation_throughput`	Average generation throughput	< 100 tok/s
`vllm:e2e_request_latency_seconds`	End-to-end request latency	P99 > 10s
`vllm:num_preemptions`	Preemption count	> 0

RAG Vector Service K8s Deployment: Milvus Cluster + Elasticsearch

Overall Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    RAG Vector Service Architecture                │
│                                                                    │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐        │
│  │  User     │───→│ Spring Boot  │───→│ vLLM Embedding   │        │
│  │  Query    │    │ AI Gateway   │    │ (Vector Gen)     │        │
│  └──────────┘    └──────┬───────┘    └──────────────────┘        │
│                         │                                        │
│              ┌──────────┼──────────┐                              │
│              ▼                     ▼                              │
│     ┌──────────────┐    ┌──────────────┐                         │
│     │   Milvus     │    │Elasticsearch │                         │
│     │  (Vector     │    │ (Full-text   │                         │
│     │   Search)    │    │  Search)     │                         │
│     │  HNSW/IVF   │    │  BM25+Dense  │                         │
│     └──────────────┘    └──────────────┘                         │
│            │                    │                                │
│            └────────┬───────────┘                                │
│                     ▼                                            │
│              ┌──────────────┐    ┌──────────────────┐           │
│              │  Hybrid      │───→│ vLLM LLM         │           │
│              │  Results     │    │ (Answer Gen)     │           │
│              │  Rerank      │    │                  │           │
│              └──────────────┘    └──────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Milvus Cluster Deployment

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: milvus
  namespace: ai-rag
spec:
  serviceName: milvus-headless
  replicas: 3
  selector:
    matchLabels:
      app: milvus
  template:
    metadata:
      labels:
        app: milvus
    spec:
      containers:
        - name: milvus
          image: milvusdb/milvus:v2.4.17
          ports:
            - containerPort: 19530
            - containerPort: 9091
          resources:
            requests:
              cpu: "4"
              memory: 8Gi
            limits:
              cpu: "8"
              memory: 16Gi
          env:
            - name: ETCD_ENDPOINTS
              value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
            - name: MINIO_ADDRESS
              value: "minio:9000"
            - name: COMMON_STORAGETYPE
              value: "remote"
          volumeMounts:
            - name: milvus-data
              mountPath: /var/lib/milvus
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 15
            periodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: milvus-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: milvus-svc
  namespace: ai-rag
spec:
  selector:
    app: milvus
  ports:
    - name: grpc
      port: 19530
      targetPort: 19530
    - name: metrics
      port: 9091
      targetPort: 9091
  type: ClusterIP

Elasticsearch Deployment

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: ai-rag
spec:
  serviceName: elasticsearch-headless
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
        - name: sysctl
          image: busybox
          command: ["sysctl", "-w", "vm.max_map_count=262144"]
          securityContext:
            privileged: true
      containers:
        - name: elasticsearch
          image: elasticsearch:8.15.0
          ports:
            - containerPort: 9200
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
          env:
            - name: cluster.name
              value: "ai-rag-cluster"
            - name: discovery.seed_hosts
              value: "elasticsearch-0.elasticsearch-headless,elasticsearch-1.elasticsearch-headless,elasticsearch-2.elasticsearch-headless"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: xpack.security.enabled
              value: "false"
            - name: ES_JAVA_OPTS
              value: "-Xms4g -Xmx4g"
          volumeMounts:
            - name: es-data
              mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
    - metadata:
        name: es-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 200Gi

Milvus Collection Creation

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="http://milvus-svc.ai-rag.svc:19530")

schema = client.create_schema(auto_id=True, enable_dynamic_field=True)

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="timestamp", datatype=DataType.INT64)

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 32, "efConstruction": 256}
)
index_params.add_index(
    field_name="text",
    index_type="INVERTED"
)

client.create_collection(
    collection_name="knowledge_base",
    schema=schema,
    index_params=index_params,
)

Vector Search vs Full-text Search vs Hybrid Search

Search Type	Principle	Advantage	Disadvantage	Use Case
Vector Search	Semantic similarity	Understands semantics	Weak exact match	Semantic Q&A
Full-text Search	BM25 keywords	Exact matching	No semantic understanding	Keyword search
Hybrid Search	Vector + full-text fusion	Best of both worlds	Higher compute cost	Production RAG
Rerank	Cross-encoder reranking	Highest accuracy	Slower	High-precision scenarios

Spring Boot AI Microservices on K8s

Spring Boot AI Application Dockerfile

FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY gradle/ gradle/
COPY gradlew build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon
COPY src/ src/
RUN ./gradlew bootJar --no-daemon -x test

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar

ENV JAVA_OPTS="-XX:+UseZGC -XX:MaxRAMPercentage=75.0"
ENV SPRING_PROFILES_ACTIVE=k8s

EXPOSE 8080

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD wget -qO- http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["sh", "-c", "java ${JAVA_OPTS} -jar app.jar"]

K8s Deployment + Service + Ingress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: springboot-ai-service
  namespace: ai-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: springboot-ai-service
  template:
    metadata:
      labels:
        app: springboot-ai-service
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      containers:
        - name: app
          image: registry.example.com/springboot-ai-service:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "2"
              memory: 4Gi
          env:
            - name: SPRING_AI_OPENAI_BASE-URL
              value: "http://vllm-qwen7b-svc.ai-inference.svc:8000/v1"
            - name: SPRING_AI_OPENAI_API-KEY
              valueFrom:
                secretKeyRef:
                  name: ai-api-keys
                  key: openai-key
            - name: MILVUS_HOST
              value: "milvus-svc.ai-rag.svc"
            - name: MILVUS_PORT
              value: "19530"
            - name: ELASTICSEARCH_URIS
              value: "http://elasticsearch-svc.ai-rag.svc:9200"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 5
          volumeMounts:
            - name: config
              mountPath: /app/config
      volumes:
        - name: config
          configMap:
            name: springboot-ai-config
---
apiVersion: v1
kind: Service
metadata:
  name: springboot-ai-svc
  namespace: ai-app
spec:
  selector:
    app: springboot-ai-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: springboot-ai-ingress
  namespace: ai-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-AI-Service: springboot-ai";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - ai-api.example.com
      secretName: ai-api-tls
  rules:
    - host: ai-api.example.com
      http:
        paths:
          - path: /api/ai
            pathType: Prefix
            backend:
              service:
                name: springboot-ai-svc
                port:
                  number: 80

Spring Boot AI Core Configuration

spring:
  ai:
    openai:
      base-url: http://vllm-qwen7b-svc.ai-inference.svc:8000/v1
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: Qwen/Qwen2.5-7B-Instruct-AWQ
          temperature: 0.7
          max-tokens: 2048
    vectorstore:
      milvus:
        client:
          host: ${MILVUS_HOST:milvus-svc.ai-rag.svc}
          port: ${MILVUS_PORT:19530}
        database-name: default
        collection-name: knowledge_base
        embedding-dimension: 1024
        metric-type: COSINE
        index-type: HNSW

  elasticsearch:
    uris: ${ELASTICSEARCH_URIS:http://elasticsearch-svc.ai-rag.svc:9200}

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: springboot-ai-service
  tracing:
    sampling:
      probability: 1.0
  otlp:
    metrics:
      export:
        url: http://otel-collector.observability.svc:4318/v1/metrics
    tracing:
      endpoint: http://otel-collector.observability.svc:4318/v1/traces

Observability: Prometheus + Grafana + OpenTelemetry

Monitoring Architecture

┌──────────────────────────────────────────────────────────────────┐
│               Full-Stack Observability Architecture               │
│                                                                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ vLLM     │  │ Milvus   │  │ES        │  │Spring    │        │
│  │ /metrics │  │ /metrics │  │/_prom    │  │/actuator │        │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘        │
│       │              │              │              │              │
│       └──────────────┴──────────────┴──────────────┘              │
│                            │                                      │
│                   ┌────────▼────────┐                             │
│                   │  OTel Collector │                             │
│                   │ (Receive+Process│                             │
│                   │  +Export)       │                             │
│                   └────────┬────────┘                             │
│              ┌─────────────┼─────────────┐                        │
│              ▼             ▼             ▼                        │
│     ┌──────────────┐ ┌──────────┐ ┌──────────┐                  │
│     │  Prometheus   │ │  Jaeger  │ │  Loki    │                  │
│     │  (Metrics)    │ │ (Traces) │ │ (Logs)   │                  │
│     └──────┬───────┘ └──────────┘ └──────────┘                  │
│            ▼                                                      │
│     ┌──────────────┐                                              │
│     │   Grafana    │                                              │
│     │  (Dashboard) │                                              │
│     └──────────────┘                                              │
└──────────────────────────────────────────────────────────────────┘

OpenTelemetry Collector Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          ports:
            - containerPort: 4318
            - containerPort: 4317
          resources:
            requests:
              cpu: "500m"
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: config
              mountPath: /etc/otelcol-contrib
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
          grpc:
            endpoint: 0.0.0.0:4317
      prometheus:
        config:
          scrape_configs:
            - job_name: "vllm"
              scrape_interval: 15s
              static_configs:
                - targets: ["vllm-qwen7b-svc.ai-inference.svc:8000"]
            - job_name: "milvus"
              scrape_interval: 15s
              static_configs:
                - targets: ["milvus-svc.ai-rag.svc:9091"]

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 5s
        limit_mib: 400

    exporters:
      prometheusremotewrite:
        endpoint: http://prometheus.observability.svc:9090/api/v1/write
      otlphttp:
        endpoint: http://jaeger-collector.observability.svc:4318
      loki:
        endpoint: http://loki.observability.svc:3100/loki/api/v1/push

    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlphttp]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Prometheus Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-inference-alerts
  namespace: observability
spec:
  groups:
    - name: vllm-alerts
      rules:
        - alert: VLLMHighQueueDepth
          expr: vllm_num_requests_waiting > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM request queue depth too high"
            description: "Instance {{ $labels.instance }} has {{ $value }} waiting requests"

        - alert: VLLMHighLatency
          expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 10
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "vLLM P99 latency too high"
            description: "Instance {{ $labels.instance }} P99 latency {{ $value }}s"

        - alert: VLLMGPUCacheFull
          expr: vllm_gpu_cache_usage_perc > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM GPU cache utilization too high"
            description: "Instance {{ $labels.instance }} cache usage {{ $value }}%"

        - alert: VLLMPreemptions
          expr: increase(vllm_num_preemptions[5m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "vLLM request preemptions detected"
            description: "Instance {{ $labels.instance }} had {{ $value }} preemptions in 5m"

    - name: milvus-alerts
      rules:
        - alert: MilvusHighQueryLatency
          expr: histogram_quantile(0.99, rate(milvus_proxy_sq_latency_bucket[5m])) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Milvus query latency too high"

        - alert: MilvusInsertRateLow
          expr: rate(milvus_proxy_insert_vectors_count[5m]) < 10
          for: 10m
          labels:
            severity: info
          annotations:
            summary: "Milvus insert rate abnormally low"

Key Grafana Dashboard Panels

Panel	Metric	Visualization
Inference QPS	`rate(vllm_num_requests_total[1m])`	Line chart
P50/P90/P99 Latency	`histogram_quantile`	Line chart
GPU Cache Usage	`vllm_gpu_cache_usage_perc`	Gauge
Request Queue Depth	`vllm_num_requests_waiting`	Bar chart
Vector Search Latency	`milvus_proxy_sq_latency`	Line chart
Spring Boot JVM	`jvm_memory_used_bytes`	Area chart
HTTP Error Rate	`rate(http_server_requests_seconds_count{status=~"5.."}[5m])`	Line chart
Distributed Traces	Jaeger integration	Table + link

Cost Optimization: Quantization + MIG Reuse + Dynamic Scaling + Spot Instances

Model Quantization Comparison

Method	Accuracy Loss	Size Reduction	Inference Speedup	Compatibility
FP16 (baseline)	0%	1x	1x	All
INT8 (PTQ)	<1%	2x	1.5x	A100/H100
INT4 (GPTQ)	1-3%	4x	1.8x	Needs calibration data
INT4 (AWQ)	1-2%	4x	2x	Needs calibration data
INT4 (GGUF)	2-5%	4x	1.5x	CPU/GPU universal
FP8	<0.5%	2x	2x+	H100 native

Cost Optimization Strategy Overview

┌──────────────────────────────────────────────────────────────────┐
│              Four Pillars of Cost Optimization                    │
│                                                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────┐ │
│  │ 1.Quantize   │  │ 2.MIG Reuse  │  │ 3.Dynamic    │  │4.Spot│ │
│  │              │  │              │  │    Scaling    │  │Inst. │ │
│  │ FP16→INT4   │  │ 1GPU→7MIG   │  │ HPA+VPA     │  │Bid   │ │
│  │ Size↓75%    │  │ Util↑7x     │  │ Idle scale   │  │Cost  │ │
│  │ Speed↑2x    │  │ Cost↓85%    │  │ Cost↓40%    │  │↓70%  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────┘ │
│                                                                    │
│  Combined effect: Cost reduced to 5%-15% of baseline              │
└──────────────────────────────────────────────────────────────────┘

Cost Calculation Example

Plan	GPU Config	Monthly Cost	Notes
Baseline: FP16 Exclusive	4 × A100 80GB	$8,400	2 replicas × 2 GPU
INT4 Quantized	2 × A100 80GB	$4,200	2 replicas × 1 GPU
INT4 + MIG	1 × A100 80GB	$2,100	1 replica × 1 MIG instance
INT4 + MIG + HPA	1 × A100 80GB (idle 0.5)	$1,400	Scale to 1 replica when idle
INT4 + MIG + HPA + Spot	1 × A100 Spot	$420	Spot discount 70%

Spot Instance Strategy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen7b-spot
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen7b-spot
  template:
    metadata:
      labels:
        app: vllm-qwen7b-spot
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: cloud.google.com/gke-preemptible
                    operator: In
                    values: ["true"]
      tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          env:
            - name: GRACEFUL_SHUTDOWN
              value: "true"
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]

Quantized Model Loading Configuration

from vllm import LLM

llm_awq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    enforce_eager=True,
)

llm_gptq = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    quantization="gptq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
)

GitOps: ArgoCD + Kustomize Automated Delivery

GitOps Workflow

┌──────────────────────────────────────────────────────────────────┐
│                    GitOps Delivery Pipeline                       │
│                                                                    │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ Developer│───→│ CI Build │───→│ Image    │───→│ Git Repo │  │
│  │ Commit   │    │ & Test   │    │ Push     │    │ Manifest │  │
│  └──────────┘    └──────────┘    └──────────┘    └────┬─────┘  │
│                                                       │          │
│                                              ┌────────▼───────┐  │
│                                              │   ArgoCD       │  │
│                                              │  Detect Change │  │
│                                              │  Auto Sync     │  │
│                                              └────────┬───────┘  │
│                                                       │          │
│                                    ┌──────────────────┼──────┐   │
│                                    ▼                  ▼      ▼   │
│                              ┌──────────┐   ┌──────────┐  ┌──┐ │
│                              │  Dev     │   │  Staging  │  │PR││
│                              │  Env     │   │  Env      │  │OD│ │
│                              └──────────┘   └──────────┘  └──┘ │
└──────────────────────────────────────────────────────────────────┘

Kustomize Directory Structure

k8s/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── hpa.yaml
│   └── ingress.yaml
├── overlays/
│   ├── dev/
│   │   ├── kustomization.yaml
│   │   └── patch-replicas.yaml
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patch-resources.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── patch-replicas.yaml
│       ├── patch-resources.yaml
│       └── patch-hpa.yaml

Base Kustomization

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - hpa.yaml
  - ingress.yaml

commonLabels:
  app.kubernetes.io/part-of: ai-inference-platform
  app.kubernetes.io/managed-by: kustomize

images:
  - name: vllm-server
    newName: registry.example.com/vllm-openai
    newTag: v0.8.0

configMapGenerator:
  - name: vllm-config
    literals:
      - MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-AWQ
      - MAX_MODEL_LEN=8192
      - GPU_MEMORY_UTILIZATION=0.9

secretGenerator:
  - name: ai-api-keys
    literals:
      - openai-key=placeholder

Production Overlay

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ai-inference-production

resources:
  - ../../base

patches:
  - target:
      kind: Deployment
      name: vllm-qwen7b
    patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: vllm-qwen7b
      spec:
        replicas: 4
        template:
          spec:
            containers:
              - name: vllm
                resources:
                  limits:
                    nvidia.com/gpu: 2
                  requests:
                    nvidia.com/gpu: 2
                    cpu: "4"
                    memory: 16Gi

  - target:
      kind: HorizontalPodAutoscaler
      name: vllm-qwen7b-hpa
    patch: |
      apiVersion: autoscaling/v2
      kind: HorizontalPodAutoscaler
      metadata:
        name: vllm-qwen7b-hpa
      spec:
        minReplicas: 4
        maxReplicas: 16

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-inference-platform
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-deployed.slack: ai-platform-deploys
spec:
  project: ai-platform
  source:
    repoURL: https://git.example.com/ai-platform/k8s-manifests.git
    targetRevision: main
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-inference-production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 5m

ArgoCD App of Apps Pattern

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ai-platform-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/ai-platform/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ai-platform
  namespace: argocd
spec:
  description: AI Inference Platform Project
  sourceRepos:
    - https://git.example.com/ai-platform/*
  destinations:
    - namespace: ai-inference-*
      server: https://kubernetes.default.svc
    - namespace: ai-rag-*
      server: https://kubernetes.default.svc
    - namespace: ai-app-*
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
  namespaceResourceBlacklist:
    - group: ""
      kind: ResourceQuota

Summary: Cloud-Native AI Deployment Checklist

Phase	Checklist Item	Status
Model Serving	vLLM/Triton selection finalized	☐
Model Serving	Quantization strategy (INT4/AWQ) decided	☐
Docker	Multi-stage Dockerfile written	☐
Docker	Docker Compose local validation	☐
K8s GPU	GPU Operator installed	☐
K8s GPU	MIG/Time Slicing configured	☐
Inference	vLLM Deployment deployed	☐
Inference	HPA autoscaling configured	☐
RAG Service	Milvus cluster deployed	☐
RAG Service	Elasticsearch deployed	☐
Microservices	Spring Boot AI on K8s	☐
Microservices	Ingress routing configured	☐
Observability	Prometheus + Grafana	☐
Observability	OpenTelemetry distributed tracing	☐
Observability	Alert rules configured	☐
Cost Optimization	Model quantization validated	☐
Cost Optimization	Spot instance strategy	☐
GitOps	ArgoCD + Kustomize	☐
GitOps	Multi-environment Overlay config	☐

Cloud-native AI deployment is not the destination, but the starting point. From Docker containerization to K8s orchestration, from GPU scheduling to observability — each step is engineering accumulation. Remember: deployment without monitoring is flying blind; operations without automation is manual labor. Automate everything with GitOps, and make AI inference services true first-class citizens of cloud-native.