Cost Tutorials

How to Calculate LLM Memory and VRAM Requirements for Inference

Last updated 2026-06-21 · ByteCosts

Direct answer

To estimate LLM inference memory, add loaded weight memory, KV-cache memory for the planned tokens and concurrent sequences, peak activations and workspaces, runtime overhead, and a measured safety margin. Parameter count alone estimates only the weight component. The final capacity decision should be verified by measuring per-device peak memory under representative long-context and concurrency workloads.

Summary

Start with the deployment, not the model name

Memory requirements depend on the exact checkpoint and serving configuration. Record:

Total resident parameter count Architecture type, including dense or mixture of experts Stored weight precision or quantization method Compute data type Number of layers Hidden size Number of attention heads Number of keyvalue heads KVcache data type Input and output sequence distribution Maximum concurrent sequences Runtime, kernels, and parallelism strategy Adapters or auxiliary models loaded with the base model

A family name such as “7B” or “70B” is not a complete specification. Config files and the actual loaded checkpoint are the source for architecture values.

Start with the deployment, not the model name

Memory requirements depend on the exact checkpoint and serving configuration. Record:

A family name such as “7B” or “70B” is not a complete specification. Config files and the actual loaded checkpoint are the source for architecture values.

Step 1: estimate raw weight memory

raw weight bytes = parameter count × stored bits per parameter ÷ 8

Illustrative raw payloads for 7 billion parameters are:

Step 1: estimate raw weight memory table

Stored representation	Ideal raw weight payload
32 bits	28 GB
16 bits	14 GB
8 bits	7 GB
4 bits	3.5 GB

Step 1: estimate raw weight memory

These are decimal gigabytes and exclude every other allocation. Quantized checkpoints also need scales, possible zero points, packing metadata, and layers retained at another precision. The loaded runtime representation can differ from the file size.

For mixtureofexperts models, distinguish total resident parameters from parameters activated per token. Active parameters help explain compute, but memory planning generally starts from all weights resident on the device or device group. Offloading and expert parallelism can change placement.

Read what LLM quantization is before converting a bit label into a memory assumption.

Step 2: add quantization overhead and unquantized modules

A practical weight estimate has this shape:

loaded weight memory = packed weights + scales + zero points + higherprecision modules + runtime representation overhead

Do not apply a universal percentage without measuring the format. Group size, metadata representation, model architecture, and runtime affect overhead.

When a framework exposes a loaded memoryfootprint function, record that result after model initialization. Also inspect perdevice allocations because automatic placement may leave one accelerator much fuller than another.

Step 3: calculate KVcache memory

For many decoderonly transformers, a useful conceptual estimate is:

KV bytes = layers × 2 × KV heads × head dimension × bytes per cache value × cached tokens × concurrent sequences

The factor of two represents keys and values.

head dimension = hidden size ÷ attention heads

Use the model's keyvalue head count, not automatically the attentionhead count. Multihead attention, groupedquery attention, and multiquery attention store different numbers of KV heads.

“Cached tokens” should include the tokens retained for each active sequence, including prompt tokens and generated tokens so far. A scheduler with variable sequence lengths will have a distribution rather than one fixed number.

Worked KVcache example

Consider an illustrative architecture with:

32 layers 8 keyvalue heads Head dimension 128 Two bytes per KV value 8,192 cached tokens per sequence Four concurrent sequences

32 × 2 × 8 × 128 × 2 × 8,192 × 4 = 4,294,967,296 bytes

That is 4 GiB of KVcache payload across the four sequences, or about 1 GiB per sequence. The example is not tied to a named model. Block allocation, metadata, reserved cache pools, and runtime implementation can change measured memory.

This calculation shows why context and concurrency must be multiplied together. A model can fit for one request and fail when several long requests become active.

Step 4: add activations and temporary workspaces

Inference does not retain the full backwardpass state required for training, but it still allocates intermediate tensors. Memory can be used by:

Layer activations Attention operations Matrixmultiplication workspaces Logits and sampling buffers Prefill batches Compilation or graphcapture buffers Tensorparallel communication buffers Tokenization and request metadata

Peak workspace depends on sequence length, batch composition, kernels, and runtime. Longprompt prefill can create a different peak from decoding.

The reliable method is to measure peak allocated and reserved memory while running representative warmup and stress workloads.

Step 5: include runtime and allocator overhead

The framework, accelerator context, memory allocator, loaded libraries, and fragmentation consume capacity. Some runtimes reserve a large memory pool for KV blocks or future allocations. Reserved memory is not necessarily wasted, but it reduces what remains available to other processes.

Memory immediately after process startup Memory after model loading Memory after warmup Peak during long prefill Peak during concurrent decoding Allocated versus reserved memory when available

Subtracting these snapshots helps identify each category.

Step 6: add adapters and auxiliary models

A deployed AI system may load more than one checkpoint:

LoRA or other adapters Embedding models Rerankers Draft models for speculative decoding Safety classifiers Vision encoders Audio encoders or decoders

They may share a device or run on separate devices. Include their weights, caches, and workspaces in the perdevice budget.

A model router that selects among several fully resident models can require much more memory than a router that loads one model at a time.

Step 7: calculate perdevice placement

For multiple GPUs, total memory is not enough. Estimate and measure each device:

device memory = local weights + local KV cache + local workspaces + communication buffers + runtime overhead

Tensor parallelism can split tensors, pipeline parallelism can place layers on different devices, and expert parallelism can distribute experts. The split may be uneven. Embedding layers, output heads, and buffers can create a larger allocation on one device.

Do not divide the total estimate by GPU count unless the runtime documentation and measurements show an even distribution.

Step 8: reserve output and concurrency capacity

Build memory scenarios from the service objective:

Step 8: reserve output and concurrency capacity table

Scenario	Input	Output allowance	Concurrency	Purpose
Median	measured median	normal response	steady state	ordinary capacity
Tail	95th percentile	high percentile	peak	user experience
Guardrail	maximum accepted	configured maximum	controlled	admission policy

Step 8: reserve output and concurrency capacity

A maximum context window is not the same as a sensible production default. Admission control can reject, queue, or trim requests before they cause outofmemory failures.

The VRAM definition explains why “fits” and “serves reliably” are separate questions.

Step 9: apply measured headroom

After measuring the worst accepted workload:

required capacity = measured peak × (1 + safety margin)

The margin should reflect observed variance, runtime upgrades, workload changes, and failure policy. There is no universal correct percentage.

Headroom also affects economics. Reserving capacity improves reliability but reduces maximum utilization. Include the deployable throughput, not theoretical saturation throughput, when calculating selfhosted cost.

Use the openmodel token cost calculator after the memorysafe concurrency and throughput have been benchmarked.

A complete inferencememory equation

total required memory = loaded weights + KV cache + peak activations/workspaces + auxiliary models + runtime/allocator overhead + safety margin

A formulabased estimate used before hardware selection A measured result from the exact deployed stack

When they differ, update the model rather than hiding the difference in a generic multiplier.

Training memory is a different calculation

Training and full finetuning can require gradients, optimizer states, master weights, and saved activations for backward propagation. Mixedprecision Adam training can use many bytes per parameter before activation memory is counted.

Do not reuse an inferencefit calculation for training. Parameterefficient finetuning reduces trainable state but still has its own activation and optimizer budget.

Validation procedure

1. Load the exact checkpoint and quantization. 2. Record memory after loading. 3. Warm the selected kernels. 4. Run median input and output lengths. 5. Run tail sequence lengths. 6. Increase concurrency to the service target. 7. Exercise adapters and auxiliary models. 8. Record perdevice peak allocated and reserved memory. 9. Verify latency and throughput at the safe point. 10. Repeat after runtime or model upgrades.

The GPU VRAM fit tool can narrow hardware choices before this benchmark.

What this article covers

Start with the deployment, not the model name
Step 1: estimate raw weight memory
Step 2: add quantization overhead and unquantized modules
Step 3: calculate KVcache memory
Worked KVcache example

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

How much VRAM does a 7B model need?

There is no single answer. Raw weight memory depends on precision, while total serving memory also depends on KV cache, context, concurrency, runtime buffers, and overhead.

How do I calculate KVcache memory?

Use the model's layer count, keyvalue head count, head dimension, cache data type, active cached tokens, and concurrent sequences. Then verify against the runtime's measured allocation.

Can system RAM replace GPU VRAM?

Some runtimes support CPU offload or unifiedmemory execution, but transfers and lower bandwidth can reduce performance. A model running is not proof that it meets latency requirements.

Should I use checkpoint file size as the memory estimate?

No. File size can exclude runtime conversion, KV cache, activations, workspaces, allocator reservations, and other loaded components.

Cite this page

How to Calculate LLM Memory and VRAM Requirements for Inference. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/how-to-calculate-llm-memory-requirements/

Sources

Machine-readable

Markdown mirror