Resources

Inference Glossary

A working reference for production inference — routers, gateways, caches, observability, evaluation, fine-tuning, the engines that serve every request.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

LLM Observability

A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.

LLM Evaluation

The practice of grading model outputs against a target — using a stack of synthetic data, deterministic heuristics, calibrated LLM judges, and shadow evaluation against production traffic.

Shadow Evaluation

Running a candidate model in parallel with the production model on real traffic, serving the production response to users, and grading the candidate's response asynchronously. The cleanest way to evaluate a model swap.

Fine-Tuning

Adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset. Usually done with parameter-efficient methods like LoRA that update less than 1% of weights.

Model Quantization

Reducing the numerical precision of a model's weights and activations — from 32-bit to 16, 8, or 4 bits — to shrink memory footprint and speed up memory-bandwidth-bound inference.

Inference Latency

The end-to-end time from request arrival to response delivery. For LLMs, decomposed into time-to-first-token (TTFT) and inter-token latency (ITL).

Model Weights

The learned numerical parameters of a neural network, stored as large multi-dimensional arrays. The artifact that defines what a trained model does.

Batch Inference

Processing multiple inference requests in the same forward pass to maximize GPU throughput and hardware utilization.

OpenAI-Compatible API

An HTTP API that accepts the same request shape and returns the same response shape as the OpenAI Chat Completions endpoint — letting any OpenAI SDK client point at a different base URL.

Tensor Cores

Specialized hardware units in NVIDIA GPUs that perform matrix multiply-and-accumulate operations in a single clock cycle, accelerating deep learning by an order of magnitude over standard CUDA cores.

CUDA

NVIDIA's parallel computing platform and programming model. The runtime, libraries, and API that let software use NVIDIA GPUs for general-purpose computation.

Model Serving

The infrastructure that turns a trained model into a live, scalable endpoint — handling request routing, batching, health checks, versioning, and metrics.