AI Fundamentals

LLM Latency vs Throughput: TTFT, TPOT, TPS, and RPS Explained

Last updated 2026-06-21 · ByteCosts

Direct answer

LLM latency measures the time required to return part or all of a response, while throughput measures completed work per unit of time. Low latency improves an individual request; high throughput improves system capacity, and optimizing one can worsen the other. A production benchmark should report both under realistic prompt lengths, output lengths, concurrency, and quality constraints.

Apply this concept - Self-host LLM cost per 1M tokens calculator →

Summary

A streamed languagemodel response has several useful latency measurements.

Time to first token (TTFT) is the time from sending the request until the first content token arrives. It can include network time, queueing, tokenization, prompt processing, and initial generation.

Time per output token (TPOT), also called intertoken latency (ITL), describes the average delay between generated tokens after the first token.

Endtoend latency is the time from request submission until the final response token arrives.

Latency is not one number

A streamed languagemodel response has several useful latency measurements.

Time to first token (TTFT) is the time from sending the request until the first content token arrives. It can include network time, queueing, tokenization, prompt processing, and initial generation.

Time per output token (TPOT), also called intertoken latency (ITL), describes the average delay between generated tokens after the first token.

Endtoend latency is the time from request submission until the final response token arrives.

endtoend latency ≈ TTFT + generation time

For a response with more than one output token, a simplified average is:

TPOT ≈ (endtoend latency TTFT) ÷ (output tokens 1)

Benchmark tools can define boundaries differently, so compare metric definitions before comparing values.

Throughput measures system capacity

Tokens per second (TPS): the number of generated tokens completed per second across a system or, in some reports, for one request.

Requests per second (RPS): the number of completed requests per second.

Inputtoken throughput: prompt tokens processed per second during prefill.

Outputtoken throughput: generated tokens produced per second during decoding.

A server can report high aggregate TPS while each user experiences slower token delivery because many requests share the same capacity. Always distinguish systemwide throughput from peruser generation speed.

Why latency and throughput trade off

Batching multiple requests can improve hardware utilization and aggregate throughput. It can also make requests wait for a batch or compete for memory and compute. As concurrency rises, total throughput may increase until the system saturates, while queueing and individual latency continue to rise.

This tradeoff is workloaddependent. Interactive chat usually values low TTFT and steady token delivery. Offline document processing may tolerate longer latency in exchange for higher batch throughput and lower unit cost.

A benchmark should therefore begin with a service objective, not a single leaderboard number.

Prompt length affects TTFT

During LLM inference, the prefill stage processes the input sequence. Longer prompts generally require more prefill work and more memory. They can increase TTFT even when output length remains unchanged.

Compare systems with the same inputtoken distribution. A benchmark using 128 input tokens does not predict the experience of a RAG application that sends 20,000 tokens.

Prompt caching can reduce repeatedprefix work when the provider or serving engine supports it. It does not eliminate queueing, network latency, or all prompt processing.

Output length affects endtoend latency

Autoregressive generation emits output sequentially. More output tokens require more decoding steps. Two systems can have the same TTFT but very different endtoend latency when one produces tokens more slowly.

Maximumoutput settings also affect capacity planning. Even if average responses are short, concurrent requests that generate near the limit can occupy memory and serving slots longer than expected.

Hold output policy constant when measuring model or hardware performance.

A benchmark matrix that answers real questions

Test several combinations rather than one prompt:

A benchmark matrix that answers real questions table

Dimension	Example test points
Input length	short, median, 95th percentile
Output length	short answer, normal response, upper tail
Concurrency	1, expected steady state, peak
Request rate	below capacity, near saturation
Streaming	enabled and disabled if both matter
Quality	fixed task and acceptance threshold

A benchmark matrix that answers real questions

For each point, report distributions, not only averages:

TTFT p50, p90, and p95 TPOT or ITL p50 and p95 Endtoend latency p50 and p95 Aggregate output TPS Completed RPS Error and timeout rate Input and output token counts Hardware and software configuration

Percentiles reveal queueing and tail behavior hidden by means.

Cost per token depends on acceptable latency

For rented compute, a simplified unitcost equation is:

cost per 1M output tokens = hourly compute cost ÷ output tokens per hour × 1,000,000

But “output tokens per hour” must be measured under the latency and quality constraints of the product. Driving the server to maximum throughput may violate interactive response targets. Reserving headroom improves latency and reliability but raises effective unit cost.

The openmodel token cost calculator and selfhost versus API calculator require measured throughput for this reason.

Streaming changes perceived latency, not total work

Streaming lets the user see partial output before the response is complete. It can improve perceived responsiveness because reading begins while generation continues. It does not necessarily reduce endtoend latency or compute use.

A streamed application should monitor both TTFT and completion time. Optimizing only completion time can leave users staring at an empty interface. Optimizing only TTFT can produce a quick first token followed by an unacceptably slow stream.

Common benchmark mistakes

Reporting “tokens per second” without scope. State whether it is per request, per user, or aggregate system throughput.

Comparing different sequence lengths. Input and output length materially affect performance.

Ignoring concurrency. Singleuser speed does not establish multiuser capacity.

Omitting queueing. Serveronly measurements can look better than enduser latency.

Using peak throughput as deployable throughput. Production systems need headroom for bursts, failures, and variance.

Ignoring quality changes. A faster quantized model or shorter response is not equivalent unless it meets the same acceptance criteria.

What this article covers

Latency is not one number
Throughput measures system capacity
Why latency and throughput trade off
Prompt length affects TTFT
Output length affects endtoend latency

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

Is tokens per second a latency or throughput metric?

It can describe either peruser generation speed or aggregate system throughput, depending on the report. The benchmark must state the scope and calculation.

What is a good time to first token?

There is no universal threshold. The acceptable TTFT depends on whether the workload is interactive, asynchronous, batch, or agentic, and on the product's userexperience target.

Why does throughput increase while user latency gets worse?

Higher concurrency and batching can keep hardware busier and raise aggregate output, while each request waits longer or receives a smaller share of capacity.

Should I optimize TTFT or TPOT first?

Use the product experience. A chatbot needs a responsive first token and readable streaming speed. A batch pipeline may prioritize total completion throughput and unit cost.

Cite this page

LLM Latency vs Throughput: TTFT, TPOT, TPS, and RPS Explained. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/llm-latency-vs-throughput/

Sources

Machine-readable

Markdown mirror