Back to Inference Glossary
Inference Glossary

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

vLLM is an open-source library for fast and efficient large language model inference and serving. Originally developed at UC Berkeley, it has become one of the most widely adopted LLM serving engines because of its strong throughput, broad model coverage, and OpenAI-compatible API. vLLM supports LLaMA, Mistral, Qwen, GPT-NeoX, Falcon, and most other major architectures.

The key innovation in vLLM is **PagedAttention**, an attention algorithm inspired by virtual memory paging in operating systems. Traditional LLM serving allocates contiguous blocks of GPU memory for each request's KV cache, leading to fragmentation and wasted memory. PagedAttention stores the cache in non-contiguous pages that are dynamically allocated, which eliminates fragmentation and enables near-optimal memory utilization. The result is the ability to serve two to four times more concurrent requests on the same GPU.

Beyond PagedAttention, vLLM ships **continuous batching**, which dynamically adds new requests to running batches rather than waiting for all requests in a batch to finish. This avoids head-of-line blocking — the problem where one long-running request holds up everyone else — and keeps GPU utilization high regardless of how response lengths vary.

vLLM is the right baseline to measure other serving engines against. Cumulus' Ion engine claims a 30 to 50% throughput advantage over vLLM on Grace Hopper for the workloads we run; SGLang takes a different approach with RadixAttention. All three are reasonable choices, and the right one depends on hardware, model, and traffic shape.