AI Fundamentals
What Is VRAM for LLMs? Weights, KV Cache, Context, and Fit
Direct answer
VRAM is the high-bandwidth memory directly available to a GPU or accelerator. During LLM inference it can hold model weights, KV cache, activations, runtime workspaces, and other buffers, so a checkpoint fitting on disk does not prove that it can be served in the same amount of VRAM.
Apply this concept - GPU VRAM fit calculator for open LLMs →
Summary
A model checkpoint on an SSD is persistent data. The model must be loaded into CPU memory, GPU memory, or a combination before it can run. Checkpoint size is therefore only one input to capacity planning.
The loaded representation may differ from the files on disk. A runtime can convert data types, unpack quantized formats, retain selected layers at higher precision, allocate temporary buffers, or map tensors across several devices. Compression used for downloading a file also does not imply equivalent runtime compression.
Measure the loaded process and peak allocation in the target runtime.
An LLM inference process commonly needs memory for:
VRAM is not the same as storage
A model checkpoint on an SSD is persistent data. The model must be loaded into CPU memory, GPU memory, or a combination before it can run. Checkpoint size is therefore only one input to capacity planning.
The loaded representation may differ from the files on disk. A runtime can convert data types, unpack quantized formats, retain selected layers at higher precision, allocate temporary buffers, or map tensors across several devices. Compression used for downloading a file also does not imply equivalent runtime compression.
Measure the loaded process and peak allocation in the target runtime.
What uses VRAM during inference
An LLM inference process commonly needs memory for:
Model weights KV cache for processed tokens Temporary activations Attention and matrixmultiplication workspaces Quantization scales and metadata Token and batching buffers CUDA or runtime context Memory allocator reservations and fragmentation Adapters or multiple model components Graphcapture or compilation buffers
The exact categories depend on architecture, runtime, and hardware. This is why two implementations of the same checkpoint can report different peak memory.
Weight memory is the starting point
For a dense model, a first approximation is:
raw weight bytes = parameter count × bits per stored parameter ÷ 8
A 7billionparameter model has an ideal raw payload of about 14 GB at 16 bits per parameter, 7 GB at 8 bits, or 3.5 GB at 4 bits. These decimal values exclude overhead.
Mixtureofexperts models require extra care. Inference may activate only a subset of experts for each token, but all resident expert weights can still need storage unless the runtime offloads or distributes them. Active parameter count can help estimate compute, while total resident parameter count is usually relevant to weight memory.
The quantization guide explains why a 4bit label does not define the entire deployed footprint.
KV cache grows with tokens and concurrency
Autoregressive transformers reuse attention keys and values for tokens already processed. This KV cache prevents the model from recomputing the full sequence at every decoding step.
Its memory depends on architecture and configuration, including:
Number of layers Number of keyvalue heads Head dimension Cache data type Tokens retained per sequence Concurrent sequences
KV bytes ≈ layers × KV heads × head dimension × 2 × bytes per value × cached tokens × concurrent sequences
The factor of two represents keys and values. Model architectures can use multihead, groupedquery, or multiquery attention, so use the exact configuration rather than the total attentionhead count blindly.
Long context and high concurrency can make KV cache larger than expected even when weights fit comfortably.
Fit is not the same as useful serving capacity
A model technically fits when the runtime can load it and complete the test request. A production service needs additional headroom for realistic prompts, outputs, concurrency, temporary peaks, and operational variance.
1. Can the weights load? 2. Can one representative request complete? 3. Can the target concurrency and context distribution meet latency goals without outofmemory failures?
Only the third question describes deployable capacity.
The GPU VRAM fit tool is useful for initial filtering. Final selection should use a benchmark on the exact checkpoint and runtime.
Context length changes memory
A larger advertised context window does not allocate its full maximum in every implementation, but serving more active tokens generally increases KVcache demand. Memory planning should use the expected inputplusoutput sequence distribution and the maximum number of sequences active at once.
A workload with 32 concurrent short chats can use memory differently from one long document request. Batch schedulers can mix prefill and decoding requests, and paged KVcache systems can reduce fragmentation or allocate cache in blocks. These optimizations improve utilization but do not make token state free.
Read what an LLM context window is before assigning a context budget.
Activations and temporary peaks still matter
Inference does not retain the full training activation graph, so it usually needs less memory than training. It still creates intermediate tensors and workspaces. Prefill with a long sequence can produce a different peak from tokenbytoken decoding.
Optimized attention implementations, kernel selection, compilation, tensor parallelism, and batch size can change the peak. A process that appears safe after model loading can fail on its first large request.
Use representative warmup and stress requests when measuring memory.
MultiGPU serving
A model can be distributed across devices using tensor parallelism, pipeline parallelism, expert parallelism, or layer placement. Aggregate memory may be sufficient while an individual device remains overloaded because tensors are not divided evenly.
MultiGPU serving also introduces communication. More devices can make a model fit but do not guarantee lower latency or better cost. Interconnect bandwidth, topology, synchronization, and runtime support matter.
Record perdevice peak memory and throughput. Do not divide total model bytes by GPU count and assume an even result.
CPU offload and unified memory
Some runtimes place part of a model in system RAM or stream layers between CPU and GPU memory. This can make a model run on a smaller GPU, but transfers may reduce performance substantially.
Apple silicon and some accelerators use unified memory rather than a discrete VRAM pool. The capacity label is different, yet the same budgeting principle applies: model data, runtime state, the operating system, and other processes share finite memory and bandwidth.
“Runs” and “runs at an acceptable speed” remain separate tests.
Leave operational headroom
Allocating every available byte to theoretical model state makes a service fragile. Headroom is needed for allocator behavior, request variance, runtime upgrades, longerthanexpected output, and monitoring.
There is no universal safe percentage. Establish headroom by measuring peak memory under realistic and worstcredible workloads, then applying an explicit safety margin. Alert on both allocated and reserved memory where the runtime exposes them.
What this article covers
- VRAM is not the same as storage
- What uses VRAM during inference
- Weight memory is the starting point
- KV cache grows with tokens and concurrency
- Fit is not the same as useful serving capacity
Use it with ByteCosts calculators
After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.
The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.
Frequently asked questions
Is checkpoint size equal to required VRAM?
No. Runtime memory also includes KV cache, activations, workspaces, metadata, allocator reservations, and potentially a different loaded data type.
Why does a model load but fail when generating?
Loading proves that resident tensors fit. The first request can allocate KV cache and temporary tensors that push peak memory beyond the available capacity.
Does quantization reduce all VRAM usage?
No. It can reduce weight memory and sometimes other configured tensors, but KV cache, activations, buffers, and unquantized modules must be calculated separately.
Can two GPUs with 24 GB each run every model requiring less than 48 GB?
Not automatically. The runtime must support a suitable distribution strategy, each device's allocation must fit, and communication overhead must remain acceptable.
Cite this page
What Is VRAM for LLMs? Weights, KV Cache, Context, and Fit. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/what-is-vram-for-llms/
Sources
- Hugging Face Accelerate: Loading big models into memory
- PagedAttention: Efficient memory management for LLM serving
- Hugging Face Transformers: GPU memory usage
- NVIDIA: LLM inference benchmarking fundamentals