Calculator

Prompt Bloat Tax Calculator

Prompt bloat tax is the recurring input-token cost of system prompts, tool schemas, memory, and retrieved context that travel with each request. In the agent with 40 tool schemas example, current prompt overhead is $12,900/month and 98.6% of input tokens before trimming; trimming plus caching cuts monthly input cost to $4,793, saves $8,287, and has 8,400 cacheable prefix tokens at 0.28x break-even reuse.

Open the live Prompt Bloat Tax calculator - Hidden input cost →

Why this matters now

Recent agentic-system research calls out token consumption as a first-order systems cost, not just a model-quality detail.

OpenAI publishes prompt caching prices separately from standard input prices, which is why repeated prefix tokens need their own line item.

Example scenario

Worked example: the first preset, Agent with 40 tool schemas, runs with prompt caching enabled and the default anthropic:claude-sonnet-4.6 model. The current prompt overhead is $12,900/month and 98.6% of input tokens before trimming. After the preset trim plan and prompt caching, monthly input cost is $4,793, savings are $8,287, cacheable prefix is 8,400 tokens, and cache break-even reuse is 0.28x.

What the inputs mean

Prompt sections: system text, schemas, memory, history, and retrieval context.
Traffic: requests per month that carry those sections.
Optimization: tokens removed, cached, or sent only when needed.

What the result means

You get monthly bloat cost, optimized cost, savings, and the prompt section creating the largest input burden.

Assumptions

Only billable input tokens are counted in the bloat tax; output cost is shown separately.
Caching is applied only to selected stable prefix sections after trimming.
Gemini explicit-cache storage is added only when a committed storage row matches the selected model; otherwise storage cost is zero.
Quality checks are required before removing safety, policy, or tool instructions.

Where the prices come from

This worked example uses the committed anthropic:claude-sonnet-4.6 pricing row and its cache fields. Pricing and cache rows preserve official source URLs, last-checked timestamps, and confidence grades in committed data; this page does not refresh data.

Formula and methodology

30-day convention: all volumes are requests/month. All request, token, segment, and hour inputs are clamped to finite non-negative numbers; trim percentages are clamped to 0..100; cache hit rate is clamped to 0..1. overheadTokens = system + tools + memory + history; payloadTokens = userMsg. bloatSharePct = overhead / (overhead + payload), or 0 when the denominator is 0. monthlyInputCost = requests x (overhead + payload) x inputRate / 1e6. Output cost is requests x outputTokens x outputRate / 1e6 and is shown separately because the bloat tax focuses on input. bloatTaxMonthly = requests x overhead x inputRate / 1e6. afterTrim scales each overhead segment by (1 - trimPct / 100). savingsUsd = max(0, currentInputCost - scenarioMonthlyInputCost), and savingsPct uses currentInputCost as the denominator: currentInputCost > 0 ? savingsUsd / currentInputCost : 0. afterCache first computes prefix = min(trimmedOverhead, sum of trimmed segments whose cacheable flag is true). If cache is disabled or prefix is 0, cached totals remain the trimmed totals and cache line items are zero. If cache is enabled, cacheAwareMonthlyCost prices prefix as systemPromptTokens, reusedContextTokens as 0, freshInputTokens as trimmedOverhead + payload - prefix, outputTokens as entered, hit fraction at cache read rate, and miss fraction at the selected write tier, defaulting to write5m. Cache read rate is the model's published cacheRead when present, or 0.1 x effective input rate as a fallback; cache write rate uses published write pricing when present, otherwise 1.25 x input for write5m or 2.0 x input for write1h. Gemini explicit-cache storage is included only when a committed storage row matches providerId and the model id after the colon in recordId, lowercased. Storage uses prefix / 1e6 x storagePerMTokHour x hours/month, with the UI default hours/month constant set to 720 and the source cachedTokens equal to the computed prefix. If no storage row matches, if storage is disabled, or if the storage rate is missing or non-positive, storage cost is zero. Engine metadata fields such as warnings, token budgets, long-context notices, pricing policy notes, and effective rates are display-only details from the shared pricing engine; they do not change the reproducible prompt-bloat core beyond the effective rates returned by monthlyCost/cacheAwareMonthlyCost.

Interpretation guide

Compare alternatives with the same workload assumptions.
Stress-test output-heavy, retry-heavy, cache-miss, and power-user cases before committing budget.
Verify source links and production logs before using the estimate for billing decisions.

Limitations before production billing decisions

Treat ByteCosts calculations as planning estimates, not final billing totals. Real invoices can differ because token mix, retry rate, cache hit rate, rate limits, taxes, gateway fees, regional pricing, and negotiated discounts change the effective cost.

Verify the provider source before production billing decisions, then compare the estimate with your own logs or invoice once production traffic is live.

Frequently asked questions

What is prompt bloat?

Prompt bloat is repeated input text such as system instructions, tool schemas, memory, chat history, and retrieval context that increases cost on every request.

Should I remove all repeated context?

No. Some repeated context is required for quality and safety. The calculator is meant to identify expensive sections so you can trim, route, or cache them carefully.

When does prompt caching help?

Caching helps when a stable prefix is reused enough times for cheap reads to outweigh the initial cache write cost.

Prompt Bloat Tax Calculator. ByteCosts. https://bytecosts.com/tools/prompt-bloat-tax/

Sources

Machine-readable

Markdown mirror