# AI Hardware vs API Payback Calculator

> Canonical: https://bytecosts.com/tools/hardware-vs-api-payback/

**Direct answer.** Hardware-versus-API payback must keep input tokens, output tokens, input-prefill throughput, and output-decode throughput in separate units. Using the example numbers, $20,000 buys about 31.91B cached input tokens and 2.66B output tokens, or 34.57B total, at a 12:1 ratio. At 20 output tokens per second, output generation alone takes at least 4.21 years. Dividing all 34.57B tokens by an output-only speed produces about 54.78 years, but that calculation mixes incompatible units. A 5.5-year end-to-end estimate additionally requires about 786 input-prefill tokens per second under a simplified serial prefill-plus-decode model.

**[Open the live Hardware vs API Payback calculator - Payback years + runtime audit →](https://bytecosts.com/tools/hardware-vs-api-payback/)**

## Why this matters now

Hardware payback comparisons often combine cached API pricing with an output-only throughput figure, which can make a plausible comparison numerically inconsistent.

OpenRouter lists input, cached-input, and output rates separately, so the workload ratio and cache-hit share must remain visible in the calculation.

## Example scenario

The example uses a $20,000 hardware budget, $1.40 per 1M input tokens, $0.26 per 1M cached input tokens, $4.40 per 1M output tokens, a 12:1 input-to-output ratio, full input caching, and 20 output tokens per second. Those inputs produce $7.52 per 1M output-equivalent tokens, 2.66B output tokens, 31.91B input tokens, and 34.57B total tokens. The output-only minimum is 4.21 years. End-to-end runtime remains unknown until input-prefill throughput is supplied.

## What the inputs mean

- Hardware capital cost: purchase price plus setup cost minus expected residual value.
- API token rates: standard input, cached input, and output prices per 1M tokens.
- Token mix: input tokens per output token and the share of input served from cache.
- Throughput: measured input-prefill tokens per second and output-decode tokens per second for the exact deployment.
- Ownership costs: wall power, electricity, operations, utilization, availability, and scheduled runtime.

## What the result means

The calculator returns the API token volume purchasable for the hardware budget, an output-only minimum runtime, an end-to-end runtime when prefill throughput exists, the prefill speed required for a target comparison period, annual avoided API spend, annual operating cost, break-even token volume, and simple payback years. Missing throughput stays unknown instead of being fabricated.

## Assumptions

- The example preset preserves the supplied values so the calculation can be reproduced; it is not a current provider quote.
- The GLM 5.2 pricing preset uses OpenRouter model API rates checked on June 21, 2026.
- The 20 output tokens per second value is a manual example input, not a provenance-backed benchmark.
- Serial prefill-plus-decode runtime is a transparent simplification; production overlap, batching, queueing, and concurrency require an exact end-to-end benchmark.
- Zero electricity, operations, setup, or residual value means that line item is excluded, not estimated.

## Where the prices come from

The GLM 5.2 pricing preset uses OpenRouter's published prompt, cached-input, and completion rates. The example preset preserves the user-provided values so the arithmetic can be reproduced. The browser makes no provider API calls; all values are committed or entered by the user.

## Formula and methodology

cacheWeightedInputRate = inputRate x (1 - cacheHitRate) + cachedInputRate x cacheHitRate. apiCostPer1MOutputEquivalent = outputRate + inputTokensPerOutputToken x cacheWeightedInputRate. outputTokensForBudget = hardwareBudget / apiCostPer1MOutputEquivalent x 1,000,000. inputTokensForBudget = outputTokensForBudget x inputTokensPerOutputToken. Output-only years = outputTokensForBudget / outputDecodeTokensPerSecond / secondsPerYear. The mixed-unit diagnostic is totalTokensForBudget / outputDecodeTokensPerSecond / secondsPerYear; it is shown only to explain why that shortcut is invalid and is not used in payback. When prefill throughput is supplied, serialRuntimeSeconds = inputTokens / inputPrefillTokensPerSecond + outputTokens / outputDecodeTokensPerSecond. Annual useful capacity applies scheduled hours, utilization, and availability. Annual net savings = avoided API spend - electricity - operations. Simple payback years = net capital cost / annual net savings when annual net savings is positive.

## Interpretation guide

- Compare alternatives with the same workload assumptions.
- Stress-test output-heavy, retry-heavy, cache-miss, and power-user cases before committing budget.
- Verify source links and production logs before using the estimate for billing decisions.

## Limitations before production billing decisions

Treat ByteCosts calculations as planning estimates, not final billing totals. Real invoices can differ because token mix, retry rate, cache hit rate, rate limits, taxes, gateway fees, regional pricing, and negotiated discounts change the effective cost.

Verify the provider source before production billing decisions, then compare the estimate with your own logs or invoice once production traffic is live.

## Frequently asked questions

### Why is dividing total tokens by output tokens per second wrong?

Output tokens per second measures generation throughput. Input tokens are processed during prefill and need an input-prefill throughput or an end-to-end benchmark. Adding input and output token counts and dividing the sum by an output-only rate mixes units.

### Does 20 output tokens per second prove a 5.5-year end-to-end estimate?

No. It establishes only an output-generation lower bound. For the example calculation, output generation alone takes about 4.21 years. Reaching 5.5 years end to end also requires roughly 786 input-prefill tokens per second under the calculator's serial model.

### What costs belong in a real hardware payback calculation?

Include purchase and setup cost, residual value, average wall power, electricity, operations, utilization, availability, and the exact throughput of the deployment. Excluding a line item should be explicit rather than replaced with an invented default.

### Can I use provider endpoint speed as self-host throughput?

Not safely. Provider endpoint speed may include a different GPU count, precision, engine, batch, context, and queueing policy. Use a benchmark for the exact model version and deployment, or leave throughput unknown.

## Related ByteCosts tools

- [AI App Cost Calculator](https://bytecosts.com/tools/ai-cost-calculator/) - Estimate monthly model spend
- [Scenario Studio](https://bytecosts.com/tools/scenario-studio/) - Combine the full workload
- [Provider Pricing Index](https://bytecosts.com/tools/ai-provider-pricing/) - Verify source-backed model rates

## Cite this page

AI Hardware vs API Payback Calculator. ByteCosts. https://bytecosts.com/tools/hardware-vs-api-payback/

**Sources**

- [OpenRouter GLM 5.2 model page](https://openrouter.ai/z-ai/glm-5.2)
- [OpenRouter model API](https://openrouter.ai/api/v1/models)
- [Z.ai GLM 5.2 model weights](https://huggingface.co/zai-org/GLM-5.2)
- [ByteCosts methodology](https://bytecosts.com/methodology/)
- [ByteCosts provider pricing index](https://bytecosts.com/tools/ai-provider-pricing/)
