Skip to main content

llm-d v0.5.0

Released: February 3, 2026

Full Release Notes: View on GitHub

The llm-d ecosystem consists of multiple interconnected components that work together to provide distributed inference capabilities for large language models.

Components​

ComponentDescriptionRepositoryVersion
Inference SchedulerThe scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.llm-d/llm-d-inference-schedulerv0.5.0
Model Servicemodelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).llm-d-incubation/llm-d-modelservicellm-d-modelservice-v0.4.5
Inference SimulatorA light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.llm-d/llm-d-inference-simv0.7.1
InfrastructureA helm chart for deploying gateway and gateway related infrastructure assets for llm-d.llm-d-incubation/llm-d-infrav1.3.6
KV CacheThe libraries for tokenization, KV-events processing, and KV-cache indexing and offloading.llm-d/llm-d-kv-cachev0.5.0
Benchmark ToolsThis repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.llm-d/llm-d-benchmarkv0.3.0
Workload Variant AutoscalerGraduated from experimental to core component. Provides saturation-based autoscaling for llm-d deployments.llm-d-incubation/workload-variant-autoscalerv0.5.0
Gateway API Inference ExtensionA Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.kubernetes-sigs/gateway-api-inference-extensionv1.3.0

Container Images​

Container images are published to the GitHub Container Registry.

ghcr.io/llm-d/<image-name>:<version>
ImageDescriptionVersionPull Command
llm-d-cudaCUDA-based inference image for NVIDIA GPUsv0.5.0ghcr.io/llm-d/llm-d-cuda:v0.5.0
llm-d-xpuIntel XPU inference imagev0.5.0ghcr.io/llm-d/llm-d-xpu:v0.5.0
llm-d-cpuCPU-only inference image (New in v0.5.0)v0.5.0ghcr.io/llm-d/llm-d-cpu:v0.5.0
llm-d-inference-schedulerInference scheduler for optimized routingv0.5.0ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0
llm-d-routing-sidecarRouting sidecar for request redirectionv0.5.0ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0
llm-d-inference-simLightweight vLLM simulatorv0.7.1ghcr.io/llm-d/llm-d-inference-sim:v0.7.1

Note: The following images have been deprecated in this release: llm-d-aws.

Getting Started​

Each component has its own detailed documentation page accessible from the sidebar. For a comprehensive view of how these components work together, see the main Architecture Overview.

Previous Releases​

For information about previous versions and their features, visit the GitHub Releases page.

Contributing​

To contribute to any of these components, visit their respective repositories and follow their contribution guidelines. Each component maintains its own development workflow and contribution process.