AI Fundamentals

What Is LLM Inference? Prefill, Decoding, Latency, and Cost

Last updated 2026-06-21 · ByteCosts

Direct answer

LLM inference is the process of running a trained language model on input tokens to produce predictions or generated tokens. For autoregressive text generation, serving commonly includes queueing, prompt processing called prefill, and token-by-token decoding. Its performance and cost depend on model size, sequence lengths, batching, concurrency, hardware, and the serving runtime.

Apply this concept - Self-host vs API break-even calculator →

Summary

Training changes model parameters by optimizing them over data. Inference uses the parameters that already exist to calculate an output for a new input. Finetuning is also a training process because it updates some model weights or added parameters. Calling a model through an API or serving a downloaded checkpoint is inference.

This distinction matters financially. Training cost is driven by optimization steps, training tokens, activations, gradients, optimizer states, and hardware time. Inference cost is driven by request volume, input and output lengths, model size, serving efficiency, concurrency, and the price of the API or compute capacity.

A model can be expensive to train yet economical to call, or inexpensive to download yet expensive to serve at low utilization.

A production request can pass through several stages:

Inference is not training

A model can be expensive to train yet economical to call, or inexpensive to download yet expensive to serve at low utilization.

The stages of a textgeneration request

A production request can pass through several stages:

1. Request preparation: the application builds messages, tools, retrieved context, and parameters. 2. Queueing: the request waits for available serving capacity. 3. Tokenization: text is converted into token IDs. 4. Prefill: the model processes the input sequence and creates attention state for it. 5. Decoding: the model predicts and emits new tokens iteratively. 6. Detokenization and streaming: token IDs become text and may be sent incrementally. 7. Postprocessing: the application validates structured output, handles tool responses, stores traces, or retries.

Different providers and serving engines can combine or optimize these stages, but separating them helps diagnose performance.

Prefill and decoding behave differently

During prefill, the model processes the prompt. Longer input sequences generally increase promptprocessing work and time to first token. During decoding, the model generates output one token at a time. Longer responses therefore require more sequential generation steps.

This produces two uservisible performance questions:

How long until the first useful token appears? How quickly do the remaining tokens arrive?

They correspond to metrics such as time to first token and intertoken latency. The latency versus throughput guide defines these measurements and explains why one tokenspersecond number is insufficient.

What determines inference cost

Input tokens Cachedinput categories Output tokens Batch or priority processing Featurespecific or modalityspecific charges Failed calls and retries Secondary model calls in routing or fallback flows

For selfhosted inference, cost may include:

GPU or accelerator rental CPU, RAM, and storage Idle capacity Data transfer Autoscaling headroom Observability Engineering and operations Redundancy and regional deployment

The pertoken cost of a selfhosted system depends heavily on utilization. A fast GPU that remains idle most of the month can have a poor effective unit cost. A highly utilized system can be economical but may expose users to queueing when demand peaks.

Use the selfhost versus API calculator to model the crossover with measured throughput and realistic utilization rather than hardware peak claims.

Throughput, batching, and concurrency

Serving engines can batch work from multiple requests to use hardware more efficiently. Higher concurrency may increase total system throughput, but it can also increase queueing and peruser latency. The relationship is not unlimited. Throughput eventually saturates when compute, memory bandwidth, memory capacity, or another resource becomes the bottleneck.

This creates a servicelevel tradeoff. Maximizing tokens per second across the server is not the same as minimizing latency for one user.

Benchmark with the input and output distributions of the target application. A result measured with short prompts and short outputs cannot be transferred directly to a longdocument workload.

Memory used during inference

Model weights are only one part of the memory budget. Inference can also require:

KV cache for processed tokens Temporary activations Runtime workspaces Tokenizer and framework overhead Batching buffers Memory fragmentation allowance Multiple replicas or adapters

Long context and high concurrency can make the KV cache a major constraint. Quantization reduces weight memory, but it does not automatically reduce every other category by the same factor. Read what VRAM means for LLMs and what model quantization is before selecting hardware.

Inference quality is part of the system

Performance optimization should not invalidate the output. Changes to precision, quantization method, sampling, batching, prompt truncation, or model routing can affect quality. A meaningful comparison holds the task, input distribution, output policy, and quality threshold constant.

A cheap response that fails validation and triggers a retry may cost more than a more capable first attempt. Multistep workflows amplify this effect because one user action can produce several model calls and repeated processing stages.

Hosted inference versus local inference

Hosted APIs transfer most serving responsibility to a provider. They offer simple scaling and usagebased billing but expose the product to provider prices, limits, and terms.

Local or selfhosted inference gives more control over model, data path, runtime, and capacity. It also makes the operator responsible for deployment, performance, reliability, and utilization.

Neither approach is universally cheaper. The correct comparison is a workload model with the same quality, latency, availability, and volume requirements.

What this article covers

Inference is not training
The stages of a textgeneration request
Prefill and decoding behave differently
What determines inference cost
Throughput, batching, and concurrency

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

What is the difference between inference and generation?

Inference is the broader process of running the trained model. Generation is the iterative production of output tokens in a generative languagemodel request.

What is prefill in LLM inference?

Prefill is the stage where the model processes the input prompt and prepares attention state before iterative output generation begins. Longer input sequences generally increase prefill work.

Why is LLM inference memoryintensive?

Large models require memory for weights, and serving also uses memory for KV cache, activations, workspaces, and concurrent requests. The exact mix depends on architecture and runtime.

Is inference cost the same as API price?

No. API price is one commercial way to charge for hosted inference. Selfhosted inference has compute and operating costs, while API workflows can also add retries, retrieval, tools, and other services.

Cite this page

What Is LLM Inference? Prefill, Decoding, Latency, and Cost. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/what-is-llm-inference/

Sources

Machine-readable

Markdown mirror