In 2026, AI Not on the Cloud Means AI Not in Production
Model training happens in the lab, inference serving runs in the cloud — this is the iron law of AI engineering.
A brutal reality: your 7B model runs blazingly fast on your laptop, but collapses in production due to misconfigured GPU scheduling, container errors, and broken autoscaling. AI deployment is not just writing a Dockerfile.
Cloud-Native AI Deployment Landscape
┌─────────────────────────────────────────────────────────────────────┐
│ Cloud-Native AI Deployment Architecture │
├─────────────┬─────────────┬──────────────┬─────────────────────────┤
│ Model Serve │ Vector Svc │ Microservices│ Infrastructure │
│ │ │ │ │
│ vLLM/Triton │ Milvus │ Spring Boot │ K8s + GPU Operator │
│ TGI/Ollama │ Elastic- │ AI Gateway │ NVIDIA MIG + Time Slicing│
│ Model Router│ search │ Ingress │ Prometheus + Grafana │
│ HPA Scale │ Embedding │ Service Mesh │ ArgoCD + Kustomize │
├─────────────┴─────────────┴──────────────┴─────────────────────────┤
│ Docker Containerization + K8s Orchestration │
│ Multi-stage Build · GPU-Aware Scheduling · Elastic Scale │
│ · GitOps Delivery │
└─────────────────────────────────────────────────────────────────────┘
Three Challenges of AI Deployment
Challenge 1: GPU Scarcity — Compute is Hard Currency
┌──────────────────────────────────────────────────┐
│ GPU Resource Supply-Demand Gap │
│ │
│ Need: 100 inference services × 1 GPU = 100 GPU │
│ Reality: Cluster only has 8 × A100 = 8 GPU │
│ Gap: 92 GPU ❌ │
│ │
│ ┌──────┐ MIG Slice ┌──────────────┐ │
│ │ A100 │ ────────→ │ 7 × MIG │ │
│ │ 80GB │ Time Slice │ + Time Reuse │ │
│ └──────┘ ────────→ │ = 14+ vGPU │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────┘
| GPU Model |
VRAM |
MIG Slices |
Suitable Model Size |
Monthly Cost |
| A100 80GB |
80GB |
7 × 10GB |
7B-13B |
$2,100 |
| A100 40GB |
40GB |
4 × 10GB |
7B |
$1,400 |
| H100 80GB |
80GB |
7 × 10GB |
7B-70B |
$3,500 |
| L40S 48GB |
48GB |
No MIG |
7B-13B |
$850 |
| T4 16GB |
16GB |
No MIG |
1B-3B |
$280 |
Challenge 2: Model Size Explosion — Storage & Transfer Nightmare
| Model |
Parameters |
FP16 Size |
INT8 Size |
INT4 Size |
| Qwen2.5-7B |
7B |
14GB |
7GB |
3.5GB |
| Llama3.1-13B |
13B |
26GB |
13GB |
6.5GB |
| Qwen2.5-72B |
72B |
144GB |
72GB |
36GB |
| DeepSeek-V3-671B |
671B |
1.3TB |
671GB |
335GB |
Model pull time: Pulling a 72B FP16 model from Docker Registry takes 30+ minutes; with INT4 quantization, only 8 minutes.
Challenge 3: Latency vs Throughput — Can't Have Both
┌─────────────────────────────────────────────────────┐
│ Latency vs Throughput Trade-off Curve │
│ │
│ Throughput │ ╱ │
│ (req/sec) │ ╱ │
│ │ ╱ ← Larger batch, more throughput│
│ │ ╱ │
│ │ ╱ │
│ │ ╱ │
│ │───╱────────────────────── │
│ │ ╱ ← But latency increases too! │
│ └─────────────────────── Latency(ms) │
│ │
│ Optimal: Dynamic Batching + Continuous Batching │
└─────────────────────────────────────────────────────┘
| Strategy |
Latency |
Throughput |
Use Case |
| Single Request Serial |
Lowest |
Lowest |
Real-time Chat |
| Static Batching |
Medium |
Medium |
Offline Inference |
| Continuous Batching |
Low |
High |
Online Serving |
| Speculative Decoding |
Lower |
High |
Real-time + High Throughput |
AI Model Serving: Four Frameworks Compared
vLLM vs Triton vs TGI vs Ollama
| Dimension |
vLLM |
Triton Inference Server |
TGI (Text Generation Inference) |
Ollama |
| Developer |
UC Berkeley |
NVIDIA |
HuggingFace |
Community |
| Core Strength |
PagedAttention |
Multi-framework |
Flash Attention |
Simplicity |
| Continuous Batching |
✅ Native |
✅ Dynamic Batching |
✅ Native |
❌ |
| Tensor Parallel |
✅ |
✅ |
✅ |
❌ |
| Streaming Output |
✅ SSE |
✅ |
✅ SSE |
✅ |
| OpenAI-Compatible API |
✅ |
Needs Adapter |
✅ |
✅ |
| Quantization Support |
AWQ/GPTQ/GGUF |
INT8/FP8 |
AWQ/GPTQ/bitsandbytes |
GGUF |
| GPU Utilization |
90%+ |
80%+ |
85%+ |
60%+ |
| K8s Friendliness |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐ |
| Production Ready |
✅ |
✅ |
✅ |
⚠️ Dev/Test |
| Multi-Model Serving |
Single per instance |
Multiple per instance |
Single per instance |
Multiple per instance |
| Hot Model Loading |
❌ |
✅ |
❌ |
✅ |
Architecture Comparison
┌─────────────────────────────────────────────────────────────────┐
│ vLLM Architecture │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Request │───→│ Scheduler │───→│ PagedAttention │ │
│ │ Queue │ │ (Continuous │ │ (KV Block Mgmt) │ │
│ │ │ │ Batching) │ │ │ │
│ └─────────┘ └──────────────┘ └─────────────────┘ │
│ ↑ ↑ ↑ │
│ OpenAI API Dynamic Batch GPU HBM │
│ Compatible Size Control KV Cache │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Triton Architecture │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Model │───→│ Dynamic │───→│ Backend │ │
│ │ Repo │ │ Batcher │ │ (PyTorch/TF/ │ │
│ │ │ │ │ │ ONRT/TensorRT) │ │
│ └─────────┘ └──────────────┘ └─────────────────┘ │
│ ↑ ↑ ↑ │
│ Multi-model Priority Queue Multi-framework Engine │
└─────────────────────────────────────────────────────────────────┘
vLLM Quick Start
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
max_model_len=8192,
quantization="awq",
)
params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
)
outputs = llm.generate(["Explain the core challenges of cloud-native AI deployment"], params)
for output in outputs:
print(output.outputs[0].text)
TGI Launch Example
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:2.3.0 \
--model-id Qwen/Qwen2.5-7B-Instruct \
--quantize awq \
--max-input-length 4096 \
--max-total-tokens 8192 \
--cuda-memory-fraction 0.9
Selection Guide
┌──────────────────────────────────────────────────────┐
│ Model Serving Framework Decision Tree │
│ │
│ Need maximum GPU utilization? │
│ ├─ Yes → vLLM (PagedAttention, 90%+ utilization) │
│ └─ No ↓ │
│ Need multi-framework/multi-model coexistence? │
│ ├─ Yes → Triton (multiple models per instance) │
│ └─ No ↓ │
│ Need seamless HuggingFace ecosystem integration? │
│ ├─ Yes → TGI (Flash Attention + HF Hub) │
│ └─ No ↓ │
│ Just local dev/testing? │
│ └─ Yes → Ollama (one command to start) │
└──────────────────────────────────────────────────────┘
Dockerizing AI Applications: Multi-Stage Build + Compose
Multi-Stage Dockerfile
# ===== Stage 1: Model Download =====
FROM python:3.11-slim AS model-downloader
RUN pip install --no-cache-dir huggingface-hub
ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ARG QUANTIZATION=awq
RUN huggingface-cli download \
${MODEL_ID} \
--local-dir /models/${MODEL_ID} \
--exclude "*.safetensors" \
&& echo "Model config downloaded"
# ===== Stage 2: Inference Runtime =====
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3-pip python3.11-venv \
&& rm -rf /var/lib/apt/lists/*
RUN python3.11 -m venv /opt/vllm
ENV PATH="/opt/vllm/bin:${PATH}"
RUN pip install --no-cache-dir \
vllm==0.8.0 \
transformers>=4.45.0 \
accelerate
COPY --from=model-downloader /models /models
ARG MODEL_ID=Qwen/Qwen2.5-7B-Instruct
ENV MODEL_ID=${MODEL_ID}
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/Qwen/Qwen2.5-7B-Instruct", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--tensor-parallel-size", "2", \
"--gpu-memory-utilization", "0.9", \
"--max-model-len", "8192"]
Docker Compose Local Dev Environment
version: "3.8"
services:
vllm-server:
build:
context: .
dockerfile: Dockerfile
args:
MODEL_ID: Qwen/Qwen2.5-7B-Instruct
QUANTIZATION: awq
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- VLLM_WORKER_MULTIPROC_METHOD=spawn
volumes:
- model-cache:/root/.cache/huggingface
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped
milvus-standalone:
image: milvusdb/milvus:v2.4.17
ports:
- "19530:19530"
- "9091:9091"
environment:
- ETCD_USE_EMBED=true
- COMMON_STORAGETYPE=local
volumes:
- milvus-data:/var/lib/milvus
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 15s
timeout: 5s
retries: 3
elasticsearch:
image: elasticsearch:8.15.0
ports:
- "9200:9200"
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms1g -Xmx1g
volumes:
- es-data:/usr/share/elasticsearch/data
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health"]
interval: 15s
timeout: 5s
retries: 3
springboot-ai:
build:
context: ../springboot-ai-service
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
- SPRING_AI_OPENAI_BASE-URL=http://vllm-server:8000/v1
- SPRING_AI_OPENAI_API-KEY=not-needed
- MILVUS_HOST=milvus-standalone
- MILVUS_PORT=19530
- ELASTICSEARCH_URIS=http://elasticsearch:9200
depends_on:
vllm-server:
condition: service_healthy
milvus-standalone:
condition: service_healthy
elasticsearch:
condition: service_healthy
volumes:
model-cache:
milvus-data:
es-data:
Docker Image Optimization Tips
| Technique |
Effect |
Example |
| Multi-stage build |
Image size -60%+ |
Build tools excluded from runtime |
| Model quantization |
Size -50% to -75% |
FP16→INT4 |
| .dockerignore |
Build context reduction |
Exclude .git, pycache |
| Layer cache reuse |
Build time -80% |
Unchanged dependency layers reused |
| Base image selection |
10x size difference |
slim vs full |
| External model volume |
Image excludes model data |
Mount at runtime |
K8s GPU Scheduling: NVIDIA GPU Operator + MIG + Time Slicing
NVIDIA GPU Operator Installation
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: gpu-operator
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
targetNamespace: gpu-operator
valuesContent: |
driver:
enabled: true
version: "550.54.15"
toolkit:
enabled: true
version: "v1.15.0"
devicePlugin:
enabled: true
config:
name: mig-config
shared:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
migManager:
enabled: true
config:
name: mig-partition-config
MIG Partition Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-partition-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-balanced:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 14
mixed:
- devices: [0]
mig-enabled: true
mig-devices:
"2g.20gb": 2
"1g.10gb": 3
- devices: [1]
mig-enabled: false
GPU Time Slicing Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
a100-80gb.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
devices: all
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-operator
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
spec:
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
args:
- --config-file=/etc/nvidia-plugin/config.yaml
volumeMounts:
- name: config
mountPath: /etc/nvidia-plugin
volumes:
- name: config
configMap:
name: time-slicing-config
Pod Using MIG Instance
apiVersion: v1
kind: Pod
metadata:
name: vllm-mig-pod
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
resources:
limits:
nvidia.com/mig-1g.10gb: 1
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- Qwen/Qwen2.5-7B-Instruct-AWQ
- --max-model-len
- "4096"
- --gpu-memory-utilization
- "0.9"
GPU Scheduling Strategy Comparison
| Strategy |
Principle |
Pros |
Cons |
Use Case |
| Exclusive GPU |
1 Pod = 1 GPU |
Zero interference, stable perf |
Resource waste |
Large model inference |
| MIG Partition |
Hardware-level isolation |
True isolation, configurable sizes |
A100/H100 only |
Multi small model serving |
| Time Slicing |
Software-level multiplexing |
Works on all GPUs |
Context switch overhead |
Low-priority tasks |
| MPS |
Shared GPU context |
Low-overhead sharing |
No isolation, one crash affects all |
Same-team multi-task |
Model Inference Service Deployment: vLLM + HPA Autoscaling
vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b
namespace: ai-inference
labels:
app: vllm-qwen7b
spec:
replicas: 2
selector:
matchLabels:
app: vllm-qwen7b
template:
metadata:
labels:
app: vllm-qwen7b
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: vllm-qwen7b
topologyKey: kubernetes.io/hostname
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
cpu: "4"
memory: 16Gi
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-7B-Instruct-AWQ"
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- Qwen/Qwen2.5-7B-Instruct-AWQ
- --host
- "0.0.0.0"
- --port
- "8000"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.9"
- --max-model-len
- "8192"
- --enable-prefix-caching
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
nodeSelector:
gpu-type: a100-80gb
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
HPA Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-qwen7b-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-qwen7b
minReplicas: 2
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "10"
- type: Pods
pods:
metric:
name: vllm_gpu_cache_usage_perc
target:
type: AverageValue
averageValue: "70"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Service + PodMonitor
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen7b-svc
namespace: ai-inference
spec:
selector:
app: vllm-qwen7b
ports:
- name: http
port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: vllm-qwen7b-monitor
namespace: ai-inference
spec:
selector:
matchLabels:
app: vllm-qwen7b
podMetricsEndpoints:
- port: http
path: /metrics
interval: 15s
Key vLLM Monitoring Metrics
| Metric |
Description |
Alert Threshold |
vllm:num_requests_running |
Active requests being processed |
> 80% max_num_seqs |
vllm:num_requests_waiting |
Requests in waiting queue |
> 10 for 5 minutes |
vllm:gpu_cache_usage_perc |
KV Cache utilization |
> 90% |
vllm:avg_generation_throughput |
Average generation throughput |
< 100 tok/s |
vllm:e2e_request_latency_seconds |
End-to-end request latency |
P99 > 10s |
vllm:num_preemptions |
Preemption count |
> 0 |
RAG Vector Service K8s Deployment: Milvus Cluster + Elasticsearch
Overall Architecture
┌──────────────────────────────────────────────────────────────────┐
│ RAG Vector Service Architecture │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ User │───→│ Spring Boot │───→│ vLLM Embedding │ │
│ │ Query │ │ AI Gateway │ │ (Vector Gen) │ │
│ └──────────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │
│ ┌──────────┼──────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Milvus │ │Elasticsearch │ │
│ │ (Vector │ │ (Full-text │ │
│ │ Search) │ │ Search) │ │
│ │ HNSW/IVF │ │ BM25+Dense │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Hybrid │───→│ vLLM LLM │ │
│ │ Results │ │ (Answer Gen) │ │
│ │ Rerank │ │ │ │
│ └──────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Milvus Cluster Deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: milvus
namespace: ai-rag
spec:
serviceName: milvus-headless
replicas: 3
selector:
matchLabels:
app: milvus
template:
metadata:
labels:
app: milvus
spec:
containers:
- name: milvus
image: milvusdb/milvus:v2.4.17
ports:
- containerPort: 19530
- containerPort: 9091
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "8"
memory: 16Gi
env:
- name: ETCD_ENDPOINTS
value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
- name: MINIO_ADDRESS
value: "minio:9000"
- name: COMMON_STORAGETYPE
value: "remote"
volumeMounts:
- name: milvus-data
mountPath: /var/lib/milvus
livenessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 15
periodSeconds: 10
volumeClaimTemplates:
- metadata:
name: milvus-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: milvus-svc
namespace: ai-rag
spec:
selector:
app: milvus
ports:
- name: grpc
port: 19530
targetPort: 19530
- name: metrics
port: 9091
targetPort: 9091
type: ClusterIP
Elasticsearch Deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: ai-rag
spec:
serviceName: elasticsearch-headless
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: sysctl
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: elasticsearch:8.15.0
ports:
- containerPort: 9200
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
env:
- name: cluster.name
value: "ai-rag-cluster"
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch-headless,elasticsearch-1.elasticsearch-headless,elasticsearch-2.elasticsearch-headless"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: xpack.security.enabled
value: "false"
- name: ES_JAVA_OPTS
value: "-Xms4g -Xmx4g"
volumeMounts:
- name: es-data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: es-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
Milvus Collection Creation
from pymilvus import MilvusClient, DataType
client = MilvusClient(uri="http://milvus-svc.ai-rag.svc:19530")
schema = client.create_schema(auto_id=True, enable_dynamic_field=True)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="timestamp", datatype=DataType.INT64)
index_params = client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="HNSW",
metric_type="COSINE",
params={"M": 32, "efConstruction": 256}
)
index_params.add_index(
field_name="text",
index_type="INVERTED"
)
client.create_collection(
collection_name="knowledge_base",
schema=schema,
index_params=index_params,
)
Vector Search vs Full-text Search vs Hybrid Search
| Search Type |
Principle |
Advantage |
Disadvantage |
Use Case |
| Vector Search |
Semantic similarity |
Understands semantics |
Weak exact match |
Semantic Q&A |
| Full-text Search |
BM25 keywords |
Exact matching |
No semantic understanding |
Keyword search |
| Hybrid Search |
Vector + full-text fusion |
Best of both worlds |
Higher compute cost |
Production RAG |
| Rerank |
Cross-encoder reranking |
Highest accuracy |
Slower |
High-precision scenarios |
Spring Boot AI Microservices on K8s
Spring Boot AI Application Dockerfile
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY gradle/ gradle/
COPY gradlew build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon
COPY src/ src/
RUN ./gradlew bootJar --no-daemon -x test
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar
ENV JAVA_OPTS="-XX:+UseZGC -XX:MaxRAMPercentage=75.0"
ENV SPRING_PROFILES_ACTIVE=k8s
EXPOSE 8080
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD wget -qO- http://localhost:8080/actuator/health || exit 1
ENTRYPOINT ["sh", "-c", "java ${JAVA_OPTS} -jar app.jar"]
K8s Deployment + Service + Ingress
apiVersion: apps/v1
kind: Deployment
metadata:
name: springboot-ai-service
namespace: ai-app
spec:
replicas: 3
selector:
matchLabels:
app: springboot-ai-service
template:
metadata:
labels:
app: springboot-ai-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
spec:
containers:
- name: app
image: registry.example.com/springboot-ai-service:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
env:
- name: SPRING_AI_OPENAI_BASE-URL
value: "http://vllm-qwen7b-svc.ai-inference.svc:8000/v1"
- name: SPRING_AI_OPENAI_API-KEY
valueFrom:
secretKeyRef:
name: ai-api-keys
key: openai-key
- name: MILVUS_HOST
value: "milvus-svc.ai-rag.svc"
- name: MILVUS_PORT
value: "19530"
- name: ELASTICSEARCH_URIS
value: "http://elasticsearch-svc.ai-rag.svc:9200"
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /app/config
volumes:
- name: config
configMap:
name: springboot-ai-config
---
apiVersion: v1
kind: Service
metadata:
name: springboot-ai-svc
namespace: ai-app
spec:
selector:
app: springboot-ai-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: springboot-ai-ingress
namespace: ai-app
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-AI-Service: springboot-ai";
spec:
ingressClassName: nginx
tls:
- hosts:
- ai-api.example.com
secretName: ai-api-tls
rules:
- host: ai-api.example.com
http:
paths:
- path: /api/ai
pathType: Prefix
backend:
service:
name: springboot-ai-svc
port:
number: 80
Spring Boot AI Core Configuration
spring:
ai:
openai:
base-url: http://vllm-qwen7b-svc.ai-inference.svc:8000/v1
api-key: ${OPENAI_API_KEY}
chat:
options:
model: Qwen/Qwen2.5-7B-Instruct-AWQ
temperature: 0.7
max-tokens: 2048
vectorstore:
milvus:
client:
host: ${MILVUS_HOST:milvus-svc.ai-rag.svc}
port: ${MILVUS_PORT:19530}
database-name: default
collection-name: knowledge_base
embedding-dimension: 1024
metric-type: COSINE
index-type: HNSW
elasticsearch:
uris: ${ELASTICSEARCH_URIS:http://elasticsearch-svc.ai-rag.svc:9200}
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
tags:
application: springboot-ai-service
tracing:
sampling:
probability: 1.0
otlp:
metrics:
export:
url: http://otel-collector.observability.svc:4318/v1/metrics
tracing:
endpoint: http://otel-collector.observability.svc:4318/v1/traces
Observability: Prometheus + Grafana + OpenTelemetry
Monitoring Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Full-Stack Observability Architecture │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ vLLM │ │ Milvus │ │ES │ │Spring │ │
│ │ /metrics │ │ /metrics │ │/_prom │ │/actuator │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ OTel Collector │ │
│ │ (Receive+Process│ │
│ │ +Export) │ │
│ └────────┬────────┘ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Prometheus │ │ Jaeger │ │ Loki │ │
│ │ (Metrics) │ │ (Traces) │ │ (Logs) │ │
│ └──────┬───────┘ └──────────┘ └──────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Grafana │ │
│ │ (Dashboard) │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────┘
OpenTelemetry Collector Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.110.0
ports:
- containerPort: 4318
- containerPort: 4317
resources:
requests:
cpu: "500m"
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
volumeMounts:
- name: config
mountPath: /etc/otelcol-contrib
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
prometheus:
config:
scrape_configs:
- job_name: "vllm"
scrape_interval: 15s
static_configs:
- targets: ["vllm-qwen7b-svc.ai-inference.svc:8000"]
- job_name: "milvus"
scrape_interval: 15s
static_configs:
- targets: ["milvus-svc.ai-rag.svc:9091"]
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 5s
limit_mib: 400
exporters:
prometheusremotewrite:
endpoint: http://prometheus.observability.svc:9090/api/v1/write
otlphttp:
endpoint: http://jaeger-collector.observability.svc:4318
loki:
endpoint: http://loki.observability.svc:3100/loki/api/v1/push
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Prometheus Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-inference-alerts
namespace: observability
spec:
groups:
- name: vllm-alerts
rules:
- alert: VLLMHighQueueDepth
expr: vllm_num_requests_waiting > 20
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM request queue depth too high"
description: "Instance {{ $labels.instance }} has {{ $value }} waiting requests"
- alert: VLLMHighLatency
expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 10
for: 3m
labels:
severity: critical
annotations:
summary: "vLLM P99 latency too high"
description: "Instance {{ $labels.instance }} P99 latency {{ $value }}s"
- alert: VLLMGPUCacheFull
expr: vllm_gpu_cache_usage_perc > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM GPU cache utilization too high"
description: "Instance {{ $labels.instance }} cache usage {{ $value }}%"
- alert: VLLMPreemptions
expr: increase(vllm_num_preemptions[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "vLLM request preemptions detected"
description: "Instance {{ $labels.instance }} had {{ $value }} preemptions in 5m"
- name: milvus-alerts
rules:
- alert: MilvusHighQueryLatency
expr: histogram_quantile(0.99, rate(milvus_proxy_sq_latency_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus query latency too high"
- alert: MilvusInsertRateLow
expr: rate(milvus_proxy_insert_vectors_count[5m]) < 10
for: 10m
labels:
severity: info
annotations:
summary: "Milvus insert rate abnormally low"
Key Grafana Dashboard Panels
| Panel |
Metric |
Visualization |
| Inference QPS |
rate(vllm_num_requests_total[1m]) |
Line chart |
| P50/P90/P99 Latency |
histogram_quantile |
Line chart |
| GPU Cache Usage |
vllm_gpu_cache_usage_perc |
Gauge |
| Request Queue Depth |
vllm_num_requests_waiting |
Bar chart |
| Vector Search Latency |
milvus_proxy_sq_latency |
Line chart |
| Spring Boot JVM |
jvm_memory_used_bytes |
Area chart |
| HTTP Error Rate |
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) |
Line chart |
| Distributed Traces |
Jaeger integration |
Table + link |
Cost Optimization: Quantization + MIG Reuse + Dynamic Scaling + Spot Instances
Model Quantization Comparison
| Method |
Accuracy Loss |
Size Reduction |
Inference Speedup |
Compatibility |
| FP16 (baseline) |
0% |
1x |
1x |
All |
| INT8 (PTQ) |
<1% |
2x |
1.5x |
A100/H100 |
| INT4 (GPTQ) |
1-3% |
4x |
1.8x |
Needs calibration data |
| INT4 (AWQ) |
1-2% |
4x |
2x |
Needs calibration data |
| INT4 (GGUF) |
2-5% |
4x |
1.5x |
CPU/GPU universal |
| FP8 |
<0.5% |
2x |
2x+ |
H100 native |
Cost Optimization Strategy Overview
┌──────────────────────────────────────────────────────────────────┐
│ Four Pillars of Cost Optimization │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────┐ │
│ │ 1.Quantize │ │ 2.MIG Reuse │ │ 3.Dynamic │ │4.Spot│ │
│ │ │ │ │ │ Scaling │ │Inst. │ │
│ │ FP16→INT4 │ │ 1GPU→7MIG │ │ HPA+VPA │ │Bid │ │
│ │ Size↓75% │ │ Util↑7x │ │ Idle scale │ │Cost │ │
│ │ Speed↑2x │ │ Cost↓85% │ │ Cost↓40% │ │↓70% │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────┘ │
│ │
│ Combined effect: Cost reduced to 5%-15% of baseline │
└──────────────────────────────────────────────────────────────────┘
Cost Calculation Example
| Plan |
GPU Config |
Monthly Cost |
Notes |
| Baseline: FP16 Exclusive |
4 × A100 80GB |
$8,400 |
2 replicas × 2 GPU |
| INT4 Quantized |
2 × A100 80GB |
$4,200 |
2 replicas × 1 GPU |
| INT4 + MIG |
1 × A100 80GB |
$2,100 |
1 replica × 1 MIG instance |
| INT4 + MIG + HPA |
1 × A100 80GB (idle 0.5) |
$1,400 |
Scale to 1 replica when idle |
| INT4 + MIG + HPA + Spot |
1 × A100 Spot |
$420 |
Spot discount 70% |
Spot Instance Strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b-spot
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-qwen7b-spot
template:
metadata:
labels:
app: vllm-qwen7b-spot
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: cloud.google.com/gke-preemptible
operator: In
values: ["true"]
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
env:
- name: GRACEFUL_SHUTDOWN
value: "true"
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]
Quantized Model Loading Configuration
from vllm import LLM
llm_awq = LLM(
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
quantization="awq",
gpu_memory_utilization=0.85,
max_model_len=4096,
enforce_eager=True,
)
llm_gptq = LLM(
model="Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
quantization="gptq",
gpu_memory_utilization=0.85,
max_model_len=4096,
)
GitOps: ArgoCD + Kustomize Automated Delivery
GitOps Workflow
┌──────────────────────────────────────────────────────────────────┐
│ GitOps Delivery Pipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Developer│───→│ CI Build │───→│ Image │───→│ Git Repo │ │
│ │ Commit │ │ & Test │ │ Push │ │ Manifest │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌────────▼───────┐ │
│ │ ArgoCD │ │
│ │ Detect Change │ │
│ │ Auto Sync │ │
│ └────────┬───────┘ │
│ │ │
│ ┌──────────────────┼──────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──┐ │
│ │ Dev │ │ Staging │ │PR││
│ │ Env │ │ Env │ │OD│ │
│ └──────────┘ └──────────┘ └──┘ │
└──────────────────────────────────────────────────────────────────┘
Kustomize Directory Structure
k8s/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── hpa.yaml
│ └── ingress.yaml
├── overlays/
│ ├── dev/
│ │ ├── kustomization.yaml
│ │ └── patch-replicas.yaml
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── patch-resources.yaml
│ └── production/
│ ├── kustomization.yaml
│ ├── patch-replicas.yaml
│ ├── patch-resources.yaml
│ └── patch-hpa.yaml
Base Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yaml
- ingress.yaml
commonLabels:
app.kubernetes.io/part-of: ai-inference-platform
app.kubernetes.io/managed-by: kustomize
images:
- name: vllm-server
newName: registry.example.com/vllm-openai
newTag: v0.8.0
configMapGenerator:
- name: vllm-config
literals:
- MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-AWQ
- MAX_MODEL_LEN=8192
- GPU_MEMORY_UTILIZATION=0.9
secretGenerator:
- name: ai-api-keys
literals:
- openai-key=placeholder
Production Overlay
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ai-inference-production
resources:
- ../../base
patches:
- target:
kind: Deployment
name: vllm-qwen7b
patch: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen7b
spec:
replicas: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
cpu: "4"
memory: 16Gi
- target:
kind: HorizontalPodAutoscaler
name: vllm-qwen7b-hpa
patch: |
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-qwen7b-hpa
spec:
minReplicas: 4
maxReplicas: 16
ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ai-inference-platform
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-deployed.slack: ai-platform-deploys
spec:
project: ai-platform
source:
repoURL: https://git.example.com/ai-platform/k8s-manifests.git
targetRevision: main
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: ai-inference-production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 3
backoff:
duration: 30s
factor: 2
maxDuration: 5m
ArgoCD App of Apps Pattern
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ai-platform-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://git.example.com/ai-platform/argocd-apps.git
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: ai-platform
namespace: argocd
spec:
description: AI Inference Platform Project
sourceRepos:
- https://git.example.com/ai-platform/*
destinations:
- namespace: ai-inference-*
server: https://kubernetes.default.svc
- namespace: ai-rag-*
server: https://kubernetes.default.svc
- namespace: ai-app-*
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: ""
kind: Namespace
namespaceResourceBlacklist:
- group: ""
kind: ResourceQuota
Summary: Cloud-Native AI Deployment Checklist
| Phase |
Checklist Item |
Status |
| Model Serving |
vLLM/Triton selection finalized |
☐ |
| Model Serving |
Quantization strategy (INT4/AWQ) decided |
☐ |
| Docker |
Multi-stage Dockerfile written |
☐ |
| Docker |
Docker Compose local validation |
☐ |
| K8s GPU |
GPU Operator installed |
☐ |
| K8s GPU |
MIG/Time Slicing configured |
☐ |
| Inference |
vLLM Deployment deployed |
☐ |
| Inference |
HPA autoscaling configured |
☐ |
| RAG Service |
Milvus cluster deployed |
☐ |
| RAG Service |
Elasticsearch deployed |
☐ |
| Microservices |
Spring Boot AI on K8s |
☐ |
| Microservices |
Ingress routing configured |
☐ |
| Observability |
Prometheus + Grafana |
☐ |
| Observability |
OpenTelemetry distributed tracing |
☐ |
| Observability |
Alert rules configured |
☐ |
| Cost Optimization |
Model quantization validated |
☐ |
| Cost Optimization |
Spot instance strategy |
☐ |
| GitOps |
ArgoCD + Kustomize |
☐ |
| GitOps |
Multi-environment Overlay config |
☐ |
Cloud-native AI deployment is not the destination, but the starting point. From Docker containerization to K8s orchestration, from GPU scheduling to observability — each step is engineering accumulation. Remember: deployment without monitoring is flying blind; operations without automation is manual labor. Automate everything with GitOps, and make AI inference services true first-class citizens of cloud-native.