Self-host an open model
Self-host Qwen3.6 35B A3B: GPU, VRAM, and rental cost
Direct answer
Self-hosting Qwen3.6 35B A3B (36B, 3B active) in FP8 needs about 49 GB of VRAM per GPU at 8K context and 8 concurrent requests. It fits on 7 tracked GPUs on a single card, the cheapest being the A100 PCIe 80GB from about $1.39 per GPU-hour. Larger context or higher concurrency grows the KV cache and can push it past one card into tensor parallelism. This is a fit and rental-rate estimate, not a throughput quote; use the calculators below for cost per token.
Estimate cost per 1M tokens - Self-host Qwen3.6 35B A3B serving cost →
GPUs that hold Qwen3.6 35B A3B on one card
Single-GPU fit in FP8 at 8K context, 8 concurrent requests, with the cheapest tracked rental rate.
| GPU | VRAM | Cheapest /GPU-hr | Provider |
|---|---|---|---|
| A100 PCIe 80GB | 80 GB | $1.39 | RunPod |
| A100 SXM 80GB | 80 GB | $1.49 | RunPod |
| H100 SXM 80GB | 80 GB | $1.80 | Vast.ai |
| GH200 Grace Hopper 96GB HBM3 | 96 GB | $2.29 | Lambda |
| H200 SXM 141GB | 141 GB | $2.35 | Vast.ai |
| H100 NVL 94GB | 94 GB | $3.19 | RunPod |
| B200 (HGX, per GPU) | 180 GB | $5.89 | RunPod |
Frequently asked questions
What GPU do I need to run Qwen3.6 35B A3B?
In FP8 at 8K context, Qwen3.6 35B A3B needs about 49 GB of VRAM per GPU. The cheapest single GPU that holds it is the A100 PCIe 80GB (80 GB) from around $1.39 per GPU-hour. Higher context or concurrency needs more VRAM or tensor parallelism.
How is the VRAM figure calculated?
Model weights (parameters times bytes per weight for the precision) plus the KV cache (from the model's real layers, KV heads, head dimension, and attention pattern) plus activation and a safety margin. Architecture comes from the model's Hugging Face config; GPU VRAM from the NVIDIA datasheet.
Cite this page
Self-host Qwen3.6 35B A3B: GPU and VRAM. ByteCosts. Updated June 18, 2026. https://bytecosts.com/gpu/self-host/qwen3-6-35b-a3b/