Cost Tutorials
How to Calculate LLM Memory and VRAM Requirements for Inference
Direct answer
To estimate LLM inference memory, add loaded weight memory, KV-cache memory for the planned tokens and concurrent sequences, peak activations and workspaces, runtime overhead, and a measured safety margin. Parameter count alone estimates only the weight component. The final capacity decision should be verified by measuring per-device peak memory under representative long-context and concurrency workloads.
Use the related calculator - GPU VRAM fit calculator for open LLMs →
Summary
Start with the deployment, not the model name
Memory requirements depend on the exact checkpoint and serving configuration. Record:
Total resident parameter count Architecture type, including dense or mixture of experts Stored weight precision or quantization method Compute data type Number of layers Hidden size Number of attention heads Number of keyvalue heads KVcache data type Input and output sequence distribution Maximum concurrent sequences Runtime, kernels, and parallelism strategy Adapters or auxiliary models loaded with the base model
A family name such as “7B” or “70B” is not a complete specification. Config files and the actual loaded checkpoint are the source for architecture values.
Start with the deployment, not the model name
Memory requirements depend on the exact checkpoint and serving configuration. Record:
Total resident parameter count Architecture type, including dense or mixture of experts Stored weight precision or quantization method Compute data type Number of layers Hidden size Number of attention heads Number of keyvalue heads KVcache data type Input and output sequence distribution Maximum concurrent sequences Runtime, kernels, and parallelism strategy Adapters or auxiliary models loaded with the base model
A family name such as “7B” or “70B” is not a complete specification. Config files and the actual loaded checkpoint are the source for architecture values.
Step 1: estimate raw weight memory
raw weight bytes = parameter count × stored bits per parameter ÷ 8
Illustrative raw payloads for 7 billion parameters are:
Step 1: estimate raw weight memory table
| Stored representation | Ideal raw weight payload |
|---|---|
| 32 bits | 28 GB |
| 16 bits | 14 GB |
| 8 bits | 7 GB |
| 4 bits | 3.5 GB |
Step 1: estimate raw weight memory
These are decimal gigabytes and exclude every other allocation. Quantized checkpoints also need scales, possible zero points, packing metadata, and layers retained at another precision. The loaded runtime representation can differ from the file size.
For mixtureofexperts models, distinguish total resident parameters from parameters activated per token. Active parameters help explain compute, but memory planning generally starts from all weights resident on the device or device group. Offloading and expert parallelism can change placement.
Read what LLM quantization is before converting a bit label into a memory assumption.
Step 2: add quantization overhead and unquantized modules
A practical weight estimate has this shape:
loaded weight memory = packed weights + scales + zero points + higherprecision modules + runtime representation overhead
Do not apply a universal percentage without measuring the format. Group size, metadata representation, model architecture, and runtime affect overhead.
When a framework exposes a loaded memoryfootprint function, record that result after model initialization. Also inspect perdevice allocations because automatic placement may leave one accelerator much fuller than another.
Step 3: calculate KVcache memory
For many decoderonly transformers, a useful conceptual estimate is:
KV bytes = layers × 2 × KV heads × head dimension × bytes per cache value × cached tokens × concurrent sequences
The factor of two represents keys and values.
head dimension = hidden size ÷ attention heads
Use the model's keyvalue head count, not automatically the attentionhead count. Multihead attention, groupedquery attention, and multiquery attention store different numbers of KV heads.
“Cached tokens” should include the tokens retained for each active sequence, including prompt tokens and generated tokens so far. A scheduler with variable sequence lengths will have a distribution rather than one fixed number.
Worked KVcache example
Consider an illustrative architecture with:
32 layers 8 keyvalue heads Head dimension 128 Two bytes per KV value 8,192 cached tokens per sequence Four concurrent sequences
32 × 2 × 8 × 128 × 2 × 8,192 × 4 = 4,294,967,296 bytes
That is 4 GiB of KVcache payload across the four sequences, or about 1 GiB per sequence. The example is not tied to a named model. Block allocation, metadata, reserved cache pools, and runtime implementation can change measured memory.
This calculation shows why context and concurrency must be multiplied together. A model can fit for one request and fail when several long requests become active.
Step 4: add activations and temporary workspaces
Inference does not retain the full backwardpass state required for training, but it still allocates intermediate tensors. Memory can be used by:
Layer activations Attention operations Matrixmultiplication workspaces Logits and sampling buffers Prefill batches Compilation or graphcapture buffers Tensorparallel communication buffers Tokenization and request metadata
Peak workspace depends on sequence length, batch composition, kernels, and runtime. Longprompt prefill can create a different peak from decoding.
The reliable method is to measure peak allocated and reserved memory while running representative warmup and stress workloads.
Step 5: include runtime and allocator overhead
The framework, accelerator context, memory allocator, loaded libraries, and fragmentation consume capacity. Some runtimes reserve a large memory pool for KV blocks or future allocations. Reserved memory is not necessarily wasted, but it reduces what remains available to other processes.
Memory immediately after process startup Memory after model loading Memory after warmup Peak during long prefill Peak during concurrent decoding Allocated versus reserved memory when available
Subtracting these snapshots helps identify each category.
Step 6: add adapters and auxiliary models
A deployed AI system may load more than one checkpoint:
LoRA or other adapters Embedding models Rerankers Draft models for speculative decoding Safety classifiers Vision encoders Audio encoders or decoders
They may share a device or run on separate devices. Include their weights, caches, and workspaces in the perdevice budget.
A model router that selects among several fully resident models can require much more memory than a router that loads one model at a time.
Step 7: calculate perdevice placement
For multiple GPUs, total memory is not enough. Estimate and measure each device:
device memory = local weights + local KV cache + local workspaces + communication buffers + runtime overhead
Tensor parallelism can split tensors, pipeline parallelism can place layers on different devices, and expert parallelism can distribute experts. The split may be uneven. Embedding layers, output heads, and buffers can create a larger allocation on one device.
Do not divide the total estimate by GPU count unless the runtime documentation and measurements show an even distribution.
Step 8: reserve output and concurrency capacity
Build memory scenarios from the service objective:
Step 8: reserve output and concurrency capacity table
| Scenario | Input | Output allowance | Concurrency | Purpose |
|---|---|---|---|---|
| Median | measured median | normal response | steady state | ordinary capacity |
| Tail | 95th percentile | high percentile | peak | user experience |
| Guardrail | maximum accepted | configured maximum | controlled | admission policy |
Step 8: reserve output and concurrency capacity
A maximum context window is not the same as a sensible production default. Admission control can reject, queue, or trim requests before they cause outofmemory failures.
The VRAM definition explains why “fits” and “serves reliably” are separate questions.
Step 9: apply measured headroom
After measuring the worst accepted workload:
required capacity = measured peak × (1 + safety margin)
The margin should reflect observed variance, runtime upgrades, workload changes, and failure policy. There is no universal correct percentage.
Headroom also affects economics. Reserving capacity improves reliability but reduces maximum utilization. Include the deployable throughput, not theoretical saturation throughput, when calculating selfhosted cost.
Use the openmodel token cost calculator after the memorysafe concurrency and throughput have been benchmarked.
A complete inferencememory equation
total required memory = loaded weights + KV cache + peak activations/workspaces + auxiliary models + runtime/allocator overhead + safety margin
A formulabased estimate used before hardware selection A measured result from the exact deployed stack
When they differ, update the model rather than hiding the difference in a generic multiplier.
Training memory is a different calculation
Training and full finetuning can require gradients, optimizer states, master weights, and saved activations for backward propagation. Mixedprecision Adam training can use many bytes per parameter before activation memory is counted.
Do not reuse an inferencefit calculation for training. Parameterefficient finetuning reduces trainable state but still has its own activation and optimizer budget.
Validation procedure
1. Load the exact checkpoint and quantization. 2. Record memory after loading. 3. Warm the selected kernels. 4. Run median input and output lengths. 5. Run tail sequence lengths. 6. Increase concurrency to the service target. 7. Exercise adapters and auxiliary models. 8. Record perdevice peak allocated and reserved memory. 9. Verify latency and throughput at the safe point. 10. Repeat after runtime or model upgrades.
The GPU VRAM fit tool can narrow hardware choices before this benchmark.
What this article covers
- Start with the deployment, not the model name
- Step 1: estimate raw weight memory
- Step 2: add quantization overhead and unquantized modules
- Step 3: calculate KVcache memory
- Worked KVcache example
Use it with ByteCosts calculators
After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.
The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.
Frequently asked questions
How much VRAM does a 7B model need?
There is no single answer. Raw weight memory depends on precision, while total serving memory also depends on KV cache, context, concurrency, runtime buffers, and overhead.
How do I calculate KVcache memory?
Use the model's layer count, keyvalue head count, head dimension, cache data type, active cached tokens, and concurrent sequences. Then verify against the runtime's measured allocation.
Can system RAM replace GPU VRAM?
Some runtimes support CPU offload or unifiedmemory execution, but transfers and lower bandwidth can reduce performance. A model running is not proof that it meets latency requirements.
Should I use checkpoint file size as the memory estimate?
No. File size can exclude runtime conversion, KV cache, activations, workspaces, allocator reservations, and other loaded components.
Cite this page
How to Calculate LLM Memory and VRAM Requirements for Inference. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/how-to-calculate-llm-memory-requirements/
Sources
- Hugging Face Accelerate: Loading big models into memory
- Hugging Face Transformers: GPU memory usage
- PagedAttention: Efficient memory management for LLM serving
- Hugging Face Transformers: bitsandbytes quantization