Back to Inference Glossary
Inference Glossary

Tensor Cores

Specialized hardware units in NVIDIA GPUs that perform matrix multiply-and-accumulate operations in a single clock cycle, accelerating deep learning by an order of magnitude over standard CUDA cores.

Tensor Cores are specialized processing units found in NVIDIA GPUs starting with the Volta architecture. They perform mixed-precision matrix multiply-and-accumulate operations as a single hardware primitive — one Tensor Core can multiply two 4x4 matrices and add a third in a single clock cycle. This architectural choice maps directly onto the linear algebra at the heart of deep learning.

Standard CUDA cores process one floating-point operation per clock cycle; Tensor Cores process entire matrix operations. The throughput difference is dramatic. On an NVIDIA H100, Tensor Cores deliver close to a petaflop of mixed-precision compute, compared to a small fraction of that from standard cores alone.

Tensor Cores support a growing set of precision formats depending on the GPU generation: FP16, BF16, TF32, INT8, FP8, and on Blackwell, FP4. Lower-precision formats enable higher throughput at the cost of numerical precision, which is often acceptable for inference workloads where the model has been quantized appropriately. Frameworks like PyTorch and JAX, and inference engines like vLLM, SGLang, TensorRT-LLM, and Ion, automatically dispatch matrix operations to Tensor Cores when shapes and data types are compatible.

Getting the most out of Tensor Cores requires aligning matrix dimensions to the Tensor Core tile sizes, typically multiples of 8 or 16 depending on precision. Inference engines that have done this work — Ion among them — extract substantially more throughput from the same hardware than naive implementations.