- Joined
- Sep 13, 2023
- Messages
- 18
KServe has emerged as the de facto standard for enterprise LLM deployment on Kubernetes, handling everything from GPU scheduling to model versioning. After deploying production systems serving 50M+ daily requests, here's the complete architecture breakdown.
Core KServe Architecture for LLMs[/HEADING=2]
Core KServe Architecture for LLMs[/HEADING=2]
KServe operates as a three-layer system:
- Control Plane: Manages InferenceService CRDs and model lifecycle
- Data Plane: Routes traffic using Knative + Istio with intelligent load balancing
- Serving Runtime: vLLM, Triton, or TensorRT-LLM for actual inference
The key differentiator is KV-cache aware scheduling - KServe routes requests to pods that already have relevant context cached, reducing cold-start latency by up to 85%.
GPU Auto-Scaling Configuration[/HEADING=2]
Standard HPA fails for LLM workloads due to GPU memory constraints. Use queue-depth scaling with KEDA:
Code:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-scaler
spec:
scaleTargetRef:
name: llama-70b-service
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_queue_depth
threshold: '5'
query: avg(vllm_queue_depth{service="llama-70b"})
Critical: Set resource limits correctly. For Llama-70B on A100-80GB:
Code:
resources:
limits:
nvidia.com/gpu: 2
memory: "160Gi"
requests:
nvidia.com/gpu: 2
memory: "120Gi"
Model Registry Integration[/HEADING=2]
KServe integrates with MLflow, Seldon, or custom S3-based registries. Use storageUri with version pinning:
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-model
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.2.3"
resources:
limits:
nvidia.com/gpu: 2
Pro tip: Use model warming with initContainers to pre-download weights during pod startup, reducing first-request latency from 3+ minutes to under 30 seconds.
Canary Deployments for Model Updates[/HEADING=2]
KServe's traffic splitting enables zero-downtime model updates:
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-canary
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.3.0"
Monitor canary metrics for 24-48 hours before full rollout. Key metrics: p99 latency, error rate, and throughput degradation.
Production Monitoring Stack[/HEADING=2]
vLLM exposes 15+ Prometheus metrics. Essential ones for LLM serving:
- vllm_queue_depth: Queue backlog (scale trigger)
- vllm_gpu_memory_usage: Memory utilization per GPU
- vllm_time_to_first_token: Prefill latency
- vllm_inter_token_latency: Decode performance
Set up alerts for queue depth > 20 and GPU memory > 90%. Use continuous batching to maximize throughput - properly configured vLLM can handle 100+ concurrent requests on a single A100.
Real-World Performance Numbers[/HEADING=2]
In production environments:
- Llama-70B on 2x A100-80GB: 45-60 tokens/second throughput
- Cold start time: 25-35 seconds with model warming
- GPU utilization: 85%+ with proper batching
- Cost reduction: 40-60% vs traditional serving due to efficient scaling
The disaggregated architecture (separate prefill/decode services) can improve GPU utilization to 95%+ for high-volume workloads, but adds complexity.
What's your experience with GPU memory fragmentation during auto-scaling events? Have you found specific batch sizes or sequence lengths that work best for your model sizes?
Standard HPA fails for LLM workloads due to GPU memory constraints. Use queue-depth scaling with KEDA:
Code:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-scaler
spec:
scaleTargetRef:
name: llama-70b-service
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_queue_depth
threshold: '5'
query: avg(vllm_queue_depth{service="llama-70b"})
Critical: Set resource limits correctly. For Llama-70B on A100-80GB:
Code:
resources:
limits:
nvidia.com/gpu: 2
memory: "160Gi"
requests:
nvidia.com/gpu: 2
memory: "120Gi"
Model Registry Integration[/HEADING=2]
KServe integrates with MLflow, Seldon, or custom S3-based registries. Use storageUri with version pinning:
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-model
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.2.3"
resources:
limits:
nvidia.com/gpu: 2
Pro tip: Use model warming with initContainers to pre-download weights during pod startup, reducing first-request latency from 3+ minutes to under 30 seconds.
Canary Deployments for Model Updates[/HEADING=2]
KServe's traffic splitting enables zero-downtime model updates:
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-canary
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.3.0"
Monitor canary metrics for 24-48 hours before full rollout. Key metrics: p99 latency, error rate, and throughput degradation.
Production Monitoring Stack[/HEADING=2]
vLLM exposes 15+ Prometheus metrics. Essential ones for LLM serving:
- vllm_queue_depth: Queue backlog (scale trigger)
- vllm_gpu_memory_usage: Memory utilization per GPU
- vllm_time_to_first_token: Prefill latency
- vllm_inter_token_latency: Decode performance
Set up alerts for queue depth > 20 and GPU memory > 90%. Use continuous batching to maximize throughput - properly configured vLLM can handle 100+ concurrent requests on a single A100.
Real-World Performance Numbers[/HEADING=2]
In production environments:
- Llama-70B on 2x A100-80GB: 45-60 tokens/second throughput
- Cold start time: 25-35 seconds with model warming
- GPU utilization: 85%+ with proper batching
- Cost reduction: 40-60% vs traditional serving due to efficient scaling
The disaggregated architecture (separate prefill/decode services) can improve GPU utilization to 95%+ for high-volume workloads, but adds complexity.
What's your experience with GPU memory fragmentation during auto-scaling events? Have you found specific batch sizes or sequence lengths that work best for your model sizes?
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-model
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.2.3"
resources:
limits:
nvidia.com/gpu: 2
KServe's traffic splitting enables zero-downtime model updates:
Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-canary
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: vllm
storageUri: "s3://models/llama-70b/v1.3.0"
Monitor canary metrics for 24-48 hours before full rollout. Key metrics: p99 latency, error rate, and throughput degradation.
Production Monitoring Stack[/HEADING=2]
vLLM exposes 15+ Prometheus metrics. Essential ones for LLM serving:
- vllm_queue_depth: Queue backlog (scale trigger)
- vllm_gpu_memory_usage: Memory utilization per GPU
- vllm_time_to_first_token: Prefill latency
- vllm_inter_token_latency: Decode performance
Set up alerts for queue depth > 20 and GPU memory > 90%. Use continuous batching to maximize throughput - properly configured vLLM can handle 100+ concurrent requests on a single A100.
Real-World Performance Numbers[/HEADING=2]
In production environments:
- Llama-70B on 2x A100-80GB: 45-60 tokens/second throughput
- Cold start time: 25-35 seconds with model warming
- GPU utilization: 85%+ with proper batching
- Cost reduction: 40-60% vs traditional serving due to efficient scaling
The disaggregated architecture (separate prefill/decode services) can improve GPU utilization to 95%+ for high-volume workloads, but adds complexity.
What's your experience with GPU memory fragmentation during auto-scaling events? Have you found specific batch sizes or sequence lengths that work best for your model sizes?
In production environments:
- Llama-70B on 2x A100-80GB: 45-60 tokens/second throughput
- Cold start time: 25-35 seconds with model warming
- GPU utilization: 85%+ with proper batching
- Cost reduction: 40-60% vs traditional serving due to efficient scaling
The disaggregated architecture (separate prefill/decode services) can improve GPU utilization to 95%+ for high-volume workloads, but adds complexity.
What's your experience with GPU memory fragmentation during auto-scaling events? Have you found specific batch sizes or sequence lengths that work best for your model sizes?