Welcome to The Advance Blog Community!

Learn, build, and grow with AI-powered strategies.

The Best AI Marketing Community to Learn, Grow, and Automate Your Business

SignUp Now!

KServe + K8s for LLM Serving: Production GPU Auto-Scaling Setup

ProfessorProfessor is verified member.

New member
Administrator
Joined
Sep 13, 2023
Messages
18
KServe has emerged as the de facto standard for enterprise LLM deployment on Kubernetes, handling everything from GPU scheduling to model versioning. After deploying production systems serving 50M+ daily requests, here's the complete architecture breakdown.

Core KServe Architecture for LLMs[/HEADING=2]

KServe operates as a three-layer system:
  • Control Plane: Manages InferenceService CRDs and model lifecycle
  • Data Plane: Routes traffic using Knative + Istio with intelligent load balancing
  • Serving Runtime: vLLM, Triton, or TensorRT-LLM for actual inference

The key differentiator is KV-cache aware scheduling - KServe routes requests to pods that already have relevant context cached, reducing cold-start latency by up to 85%.

GPU Auto-Scaling Configuration[/HEADING=2]

Standard HPA fails for LLM workloads due to GPU memory constraints. Use queue-depth scaling with KEDA:

Code:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
spec:
  scaleTargetRef:
    name: llama-70b-service
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      threshold: '5'
      query: avg(vllm_queue_depth{service="llama-70b"})

Critical: Set resource limits correctly. For Llama-70B on A100-80GB:
Code:
resources:
  limits:
    nvidia.com/gpu: 2
    memory: "160Gi"
  requests:
    nvidia.com/gpu: 2
    memory: "120Gi"

Model Registry Integration[/HEADING=2]

KServe integrates with MLflow, Seldon, or custom S3-based registries. Use storageUri with version pinning:

Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-model
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      storageUri: "s3://models/llama-70b/v1.2.3"
      resources:
        limits:
          nvidia.com/gpu: 2

Pro tip: Use model warming with initContainers to pre-download weights during pod startup, reducing first-request latency from 3+ minutes to under 30 seconds.

Canary Deployments for Model Updates[/HEADING=2]

KServe's traffic splitting enables zero-downtime model updates:

Code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-canary
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: vllm
      storageUri: "s3://models/llama-70b/v1.3.0"

Monitor canary metrics for 24-48 hours before full rollout. Key metrics: p99 latency, error rate, and throughput degradation.

Production Monitoring Stack[/HEADING=2]

vLLM exposes 15+ Prometheus metrics. Essential ones for LLM serving:
  • vllm_queue_depth: Queue backlog (scale trigger)
  • vllm_gpu_memory_usage: Memory utilization per GPU
  • vllm_time_to_first_token: Prefill latency
  • vllm_inter_token_latency: Decode performance

Set up alerts for queue depth > 20 and GPU memory > 90%. Use continuous batching to maximize throughput - properly configured vLLM can handle 100+ concurrent requests on a single A100.

Real-World Performance Numbers[/HEADING=2]

In production environments:
  • Llama-70B on 2x A100-80GB: 45-60 tokens/second throughput
  • Cold start time: 25-35 seconds with model warming
  • GPU utilization: 85%+ with proper batching
  • Cost reduction: 40-60% vs traditional serving due to efficient scaling

The disaggregated architecture (separate prefill/decode services) can improve GPU utilization to 95%+ for high-volume workloads, but adds complexity.

What's your experience with GPU memory fragmentation during auto-scaling events? Have you found specific batch sizes or sequence lengths that work best for your model sizes?​

 
Back