Break-even

AI Hardware vs API Payback Calculator

Direct answer

Hardware-versus-API payback must keep input tokens, output tokens, input-prefill throughput, and output-decode throughput in separate units. In the supplied viral snapshot, $20,000 buys about 31.91B cached input tokens and 2.66B output tokens, or 34.57B total, at a 12:1 ratio. At 20 output tokens per second, output generation alone takes at least 4.21 years. Dividing all 34.57B tokens by an output-only speed produces about 54.78 years, but that calculation is dimensionally invalid. A 5.5-year end-to-end claim additionally requires about 786 input-prefill tokens per second under a simplified serial prefill-plus-decode model.

Open the live Hardware vs API Payback calculator - Payback years + runtime audit →

Why this matters now

Hardware payback posts often combine cached API pricing with an output-only throughput figure, which can make a plausible comparison numerically inconsistent.

OpenRouter lists input, cached-input, and output rates separately, so the workload ratio and cache-hit share must remain visible in the calculation.

Example scenario

The viral screenshot uses a $20,000 hardware budget, $1.40 per 1M input tokens, $0.26 per 1M cached input tokens, $4.40 per 1M output tokens, a 12:1 input-to-output ratio, full input caching, and 20 output tokens per second. Those inputs produce $7.52 per 1M output-equivalent tokens, 2.66B output tokens, 31.91B input tokens, and 34.57B total tokens. The decode-only minimum is 4.21 years, not 54.78 years. The end-to-end runtime remains unverified until input-prefill throughput is supplied.

What the inputs mean

Hardware capital cost: purchase price plus setup cost minus expected residual value.
API token rates: standard input, cached input, and output prices per 1M tokens.
Token mix: input tokens per output token and the share of input served from cache.
Throughput: measured input-prefill tokens per second and output-decode tokens per second for the exact deployment.
Ownership costs: wall power, electricity, operations, utilization, availability, and scheduled runtime.

What the result means

The calculator returns the API token volume purchasable for the hardware budget, a decode-only lower-bound runtime, an end-to-end runtime when prefill throughput exists, the prefill speed required to satisfy a claimed payback period, annual avoided API spend, annual operating cost, break-even token volume, and simple payback years. Missing throughput stays unknown instead of being fabricated.

Assumptions

The viral preset transcribes the supplied screenshot and is labeled as a snapshot, not a current provider quote.
The current GLM 5.2 preset uses OpenRouter model API rates checked on June 21, 2026.
The 20 output tokens per second value is a claim input, not a provenance-backed benchmark.
Serial prefill-plus-decode runtime is a transparent simplification; production overlap, batching, queueing, and concurrency require an exact end-to-end benchmark.
Zero electricity, operations, setup, or residual value means that line item is excluded, not estimated.

Where the prices come from

The current preset uses OpenRouter's published GLM 5.2 prompt, cached-input, and completion rates from its model API. The viral preset preserves the prices visible in the user-supplied screenshot so the claim can be reproduced. The browser makes no provider API calls; all values are committed or entered by the user.

Formula and methodology

cacheWeightedInputRate = inputRate x (1 - cacheHitRate) + cachedInputRate x cacheHitRate. apiCostPer1MOutputEquivalent = outputRate + inputTokensPerOutputToken x cacheWeightedInputRate. outputTokensForBudget = hardwareBudget / apiCostPer1MOutputEquivalent x 1,000,000. inputTokensForBudget = outputTokensForBudget x inputTokensPerOutputToken. Decode-only years = outputTokensForBudget / outputDecodeTokensPerSecond / secondsPerYear. The intentionally displayed mixed-unit error is totalTokensForBudget / outputDecodeTokensPerSecond / secondsPerYear; it is not used in payback. When prefill throughput is supplied, serialRuntimeSeconds = inputTokens / inputPrefillTokensPerSecond + outputTokens / outputDecodeTokensPerSecond. Annual useful capacity applies scheduled hours, utilization, and availability. Annual net savings = avoided API spend - electricity - operations. Simple payback years = net capital cost / annual net savings when annual net savings is positive.

Interpretation guide

Compare alternatives with the same workload assumptions.
Stress-test output-heavy, retry-heavy, cache-miss, and power-user cases before committing budget.
Verify source links and production logs before using the estimate for billing decisions.

Limitations before production billing decisions

Treat ByteCosts calculations as planning estimates, not final billing totals. Real invoices can differ because token mix, retry rate, cache hit rate, rate limits, taxes, gateway fees, regional pricing, and negotiated discounts change the effective cost.

Verify the provider source before production billing decisions, then compare the estimate with your own logs or invoice once production traffic is live.

Frequently asked questions

Why is dividing total tokens by output tokens per second wrong?

Output tokens per second measures generation throughput. Input tokens are processed during prefill and need an input-prefill throughput or an end-to-end benchmark. Adding input and output token counts and dividing the sum by an output-only rate mixes units.

Does 20 output tokens per second prove the 5.5-year claim?

No. It proves only a decode-only lower bound. For the viral snapshot, output generation alone takes about 4.21 years. Reaching 5.5 years end to end also requires roughly 786 input-prefill tokens per second under the calculator's serial model.

What costs belong in a real hardware payback calculation?

Include purchase and setup cost, residual value, average wall power, electricity, operations, utilization, availability, and the exact throughput of the deployment. Excluding a line item should be explicit rather than replaced with an invented default.

Can I use provider endpoint speed as self-host throughput?

Not safely. Provider endpoint speed may include a different GPU count, precision, engine, batch, context, and queueing policy. Use a benchmark for the exact model version and deployment, or leave throughput unknown.

Cite this page

AI Hardware vs API Payback Calculator. ByteCosts. https://bytecosts.com/tools/hardware-vs-api-payback/

Sources

Machine-readable

Markdown mirror