Inference Latency
The end-to-end time from request arrival to response delivery. For LLMs, decomposed into time-to-first-token (TTFT) and inter-token latency (ITL).
Inference latency measures the total time required to process a single inference request, from the moment it arrives at the serving endpoint to the moment the response is returned. For user-facing applications it directly determines how responsive the system feels. It is typically tracked at multiple percentiles — P50 for the typical experience, P95 and P99 for the tail — because a small fraction of slow requests can dominate user perception.
For autoregressive language models, latency has two distinct components. **Time-to-first-token (TTFT)** is the time from request start to the first output token; it is dominated by prompt processing and is what the user perceives as initial responsiveness. **Inter-token latency (ITL)** is the time between subsequent generated tokens; combined with output length it determines generation speed. A good streaming experience has low TTFT and low ITL; one of the two being good is not enough.
Latency components scale with different things. TTFT scales with prompt length and prefill compute. ITL scales with model size and memory bandwidth. Batch size affects per-request latency since larger batches increase individual response time in exchange for higher overall throughput. The right batch size is a per-workload tuning question.
Beyond raw hardware, the techniques that reduce latency are: quantization (less data to move per token), KV caching (no replay of prior tokens), prompt caching (no prefill for shared prefixes), continuous batching (no head-of-line blocking), and custom attention kernels like IonAttention that overlap prefill and decode. The Cumulus Router additionally reduces tail latency by speculatively dispatching slow primaries to a faster secondary and returning whichever finishes first.
Related Terms
KV Cache
A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.
vLLM
An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.
Prompt Cache
A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.
Ion
Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.
Inference
Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.