# What Is LLM Quantization? Bits, Memory, Speed, and Quality

> Canonical: https://bytecosts.com/blog/what-is-model-quantization/ · Last updated 2026-06-21

**Direct answer.** LLM quantization is the process of representing model weights, activations, or both with lower-precision numeric formats. It reduces memory and data movement, and can improve inference efficiency when hardware and kernels support the chosen format, but may change output quality. The exact memory, speed, and quality effects depend on the quantization method, runtime kernels, hardware support, and workload.

**[Apply this concept - Self-host LLM cost per 1M tokens calculator →](https://bytecosts.com/tools/open-model-token-cost/)**

## Summary

Precision determines how values are represented

Neuralnetwork parameters and intermediate values are numbers. A fullprecision or halfprecision format allocates more bits to represent each value than a lowerprecision integer or floatingpoint format. Quantization maps values from a higherprecision representation into a smaller set of representable values.

For model weights, the raw storage relationship is approximately:

weight bytes = parameter count × bits per stored parameter ÷ 8

## Precision determines how values are represented

Neuralnetwork parameters and intermediate values are numbers. A fullprecision or halfprecision format allocates more bits to represent each value than a lowerprecision integer or floatingpoint format. Quantization maps values from a higherprecision representation into a smaller set of representable values.

For model weights, the raw storage relationship is approximately:

weight bytes = parameter count × bits per stored parameter ÷ 8

A dense model with 10 billion parameters would have an ideal raw weight payload of about 20 GB at 16 bits per parameter, 10 GB at 8 bits, or 5 GB at 4 bits. These are decimal approximations for the weight payload only. A real checkpoint and running process also need scales, metadata, unquantized modules, runtime buffers, KV cache, and other overhead.

Quantization is therefore a way to reduce one major part of the memory budget, not a guarantee that the entire application shrinks by exactly the same ratio.

## What can be quantized

Different methods quantize different values:

Weightonly quantization stores model weights at lower precision while computation may use another data type. Weightandactivation quantization lowers precision for both weights and intermediate activations. KVcache quantization reduces the memory used for attention state during inference. Optimizerstate quantization reduces training or finetuning memory rather than ordinary inference memory.

A model label such as “4bit” is incomplete without the method, group size, scale representation, compute type, and runtime. Two 4bit checkpoints can differ in quality, speed, and memory.

## Posttraining quantization and quantizationaware training

Posttraining quantization (PTQ) converts an already trained model. Many downloadable LLM formats use PTQ because it avoids full retraining. Calibration data may be used to choose scales or identify sensitive weights.

Quantizationaware training (QAT) simulates or incorporates quantization effects during training so the model can adapt. It can preserve quality in some settings but requires a training process and suitable data.

Methods such as GPTQ and activationaware weight quantization are examples of approaches designed to reduce LLM weight precision while controlling error. Libraries and runtimes also implement numeric formats such as int8, FP8, and NF4, plus packed checkpoint representations and hardwarespecific variants.

## Why quantization can reduce memory

Lowerbit weights require fewer bytes to store and transfer. This can allow a model to fit on a smaller accelerator, leave more room for KV cache and concurrency, or reduce the number of devices needed for one replica.

The theoretical weight saving is straightforward, but deployed memory includes:

Quantized weight data Pergroup scales and possible zero points Layers retained at higher precision Dequantization or compute buffers Runtime workspaces KV cache Activations Framework and allocator overhead

Use the LLM memory requirements guide to calculate the complete budget rather than multiplying parameters by one number.

## Why quantization may improve or hurt speed

Quantization can improve speed by reducing memory bandwidth and using faster lowprecision hardware operations. It can also make a model slower when the runtime repeatedly dequantizes values, uses an inefficient kernel, transfers data between devices, or lacks native support for the format.

GPU, CPU, or accelerator architecture Serving runtime and kernel implementation Batch size and sequence length Whether the workload is computebound or memorybound Quantization format and group size Tensor parallelism and device placement Prefill versus decoding behavior

Do not infer throughput from file size. Benchmark the exact checkpoint, runtime, hardware, input length, output length, and concurrency.

## Quality loss is taskdependent

Quantization introduces approximation error. Whether that error matters depends on the model, method, bit width, calibration data, and task. A checkpoint can perform well on broad benchmarks but regress on a narrow production workflow.

Task success rate Structuredoutput validity Retrieval or classification accuracy Code execution or test pass rate Longcontext behavior Toolcall correctness Safety and refusal behavior relevant to the product Retry rate

Retry rate belongs in the evaluation because a cheaper first pass that fails more often can increase total cost.

## Quantization is not pruning or distillation

Pruning removes weights, neurons, attention heads, or other structure. Distillation trains a smaller student model to reproduce useful behavior from a larger teacher. Quantization changes numeric representation.

These techniques can be combined, but they change the model in different ways. A quantized 70billionparameter model still has approximately the same parameter count even though each stored parameter uses fewer bits.

## Quantization and finetuning

Lowbit base models can be used with parameterefficient finetuning. QLoRA, for example, keeps a quantized frozen base model and trains added lowrank adapters. This lowers the memory required for adaptation compared with fullparameter finetuning.

That does not mean ordinary 4bit weights are updated directly in every setup. The training library, quantization method, and adapter design determine which parameters are trainable.

For inference, adapter weights and runtime support add their own memory and operational considerations.

## How to select a quantized checkpoint

1. Establish the unquantized model's quality and performance baseline. 2. List hardware and runtime formats that are actually supported. 3. Calculate weight memory plus context and concurrency memory. 4. Select two or three quantization candidates. 5. Run the same tasklevel evaluation on each candidate. 6. Benchmark latency and throughput with realistic sequence lengths. 7. Measure peak memory, not only checkpoint size. 8. Include retries and failures in effective cost. 9. Preserve the exact model, method, runtime, and configuration in results.

The openmodel token cost calculator can convert measured throughput and hourly compute price into a unit cost. It should receive benchmarked values, not generic modelfamily claims.

## What this article covers

- Precision determines how values are represented
- What can be quantized
- Posttraining quantization and quantizationaware training
- Why quantization can reduce memory
- Why quantization may improve or hurt speed

## Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

## Frequently asked questions

### Does 4bit quantization make a model four times smaller than FP16?

The ideal raw weight payload is roughly one quarter as large, but real memory includes scales, metadata, higherprecision modules, runtime buffers, KV cache, and allocator overhead. Measure the deployed process.

### Is an 8bit model always faster than a 16bit model?

No. Speed depends on hardware support, kernels, batch size, and bottlenecks. Lower memory use does not automatically produce higher throughput.

### Does quantization change the model's answers?

It can. Quantization approximates values and may change token probabilities or task behavior. Validate the exact checkpoint on productionrepresentative tests.

### Can quantization reduce KVcache memory?

Some runtimes support KVcache quantization, but weight quantization alone does not automatically change the KVcache format. Treat the two configurations separately.

## Related pricing pages

- [What Is VRAM for LLMs? Weights, KV Cache, Context, and Fit](https://bytecosts.com/blog/what-is-vram-for-llms/)
- [How to Calculate LLM Memory and VRAM Requirements for Inference](https://bytecosts.com/blog/how-to-calculate-llm-memory-requirements/)
- [What Is LLM Inference? Prefill, Decoding, Latency, and Cost](https://bytecosts.com/blog/what-is-llm-inference/)
- [Self-host LLM cost per 1M tokens calculator](https://bytecosts.com/tools/open-model-token-cost/)
- [AI Model Pricing: Compare LLM Token Costs](https://bytecosts.com/pricing/)
- [RAG cost calculator: query and context costs](https://bytecosts.com/use-cases/rag-cost-calculator/)

## Model this research

- [AI App Cost Calculator](https://bytecosts.com/tools/ai-cost-calculator/)
- [Scenario Studio](https://bytecosts.com/tools/scenario-studio/)
- [Provider Pricing Index](https://bytecosts.com/tools/ai-provider-pricing/)

## Cite this page

What Is LLM Quantization? Bits, Memory, Speed, and Quality. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/what-is-model-quantization/

**Sources**

- [Hugging Face Transformers: bitsandbytes quantization](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes)
- [GPTQ: Accurate posttraining quantization for generative pretrained transformers](https://arxiv.org/abs/2210.17323)
- [AWQ: Activationaware weight quantization for LLM compression and acceleration](https://arxiv.org/abs/2306.00978)
- [QLoRA: Efficient finetuning of quantized LLMs](https://arxiv.org/abs/2305.14314)
