AI & LLM Economics

GPU VRAM Calculator (Model Size & GPU Fit)

Enter a model's parameter count and the precision its weights are stored in, and this calculator estimates the GPU memory needed to load it for inference — the raw weight footprint plus a rough overhead for the KV-cache and activations — then checks that total against a reference table of common GPUs to show the smallest one it fits on. It is a back-of-the-envelope planning aid, not a guarantee that a given model will run on a given card.

Model

GPU table as of 2025 — verify current specs, pricing, and availability.

Estimated VRAM needed

16.8 GB

14 GB weights + 20% overhead

Recommended GPU

RTX 4090

Weights

14 GB

Memory & GPU-fit detail
Weights (2 B/param)14 GBOverhead (20%)2.8 GBTotal VRAM16.8 GB
GPUVRAMFits
RTX 40608 GBno
RTX 409024 GByes
A100 40GB40 GByes
A100 80GB80 GByes
H100 80GB80 GByes

GPU table as of 2025 — verify current specs/availability.

Model VRAM vs GPU capacity

ModelH100 80GB

How it works

The weight footprint is parameter count times bytes per parameter. Precision sets the bytes: fp32 stores 4 bytes per parameter, fp16 and bf16 store 2, int8 stores 1, and int4 packs two parameters into a single byte for 0.5. Because a billion parameters at one byte each is exactly one gigabyte, a model's size in billions multiplied by bytes per parameter gives the weight size in GB directly — a 7-billion-parameter model at fp16 (2 bytes) is 7 × 2 = 14 GB of weights.

Loading weights is not the whole story, so the calculator adds an overhead percentage (default 20%) on top to approximate the KV-cache and activation memory an inference run also consumes. Total memory is weights × (1 + overhead / 100): the 14 GB example becomes 14 × 1.2 = 16.8 GB. This overhead is a rough constant here; in reality it grows with context length and batch size, so long prompts or high concurrency can push it well above 20%.

That total is compared against a versioned table of GPUs — RTX 4060 (8 GB), RTX 4090 (24 GB), A100 (40 GB and 80 GB), and H100 (80 GB) — as of 2025. The tool reports which cards fit and recommends the smallest single GPU with enough VRAM. If the total exceeds the largest card in the table, it flags that you will need to split the model across multiple GPUs. Always verify current specs, pricing, and availability, as the GPU landscape changes quickly.

Frequently asked questions

Does quantization (int8, int4) come for free?+

No — it trades quality for memory. Casting weights from fp16 down to int8 halves the footprint, and int4 quarters it, which is why a model that needs a data-center GPU at fp16 can sometimes run on a consumer card when quantized. But quantization is lossy: it approximates each weight with fewer bits, and lower precision can degrade output quality, especially for smaller models or demanding tasks. Modern quantization methods keep the loss small, but you should benchmark accuracy on your own workload rather than assume int4 is a drop-in replacement for fp16.

Why is the overhead just a flat percentage?+

The 20% default is a deliberate simplification, and this is a rough estimate. Real inference overhead — mostly the KV-cache that stores attention state — scales with how many tokens are in context and how many requests run in parallel. A short single prompt might add far less than 20%, while a long-context or high-batch serving setup can add much more, sometimes exceeding the weights themselves. Treat the total here as a floor for a modest workload and add headroom for your actual context length and concurrency.

Does this cover training, or only inference?+

Only inference. Training needs far more VRAM than loading weights: on top of the parameters you must hold gradients, optimizer states (Adam keeps two extra values per parameter), and cached activations for backpropagation — often several times the model's own size. A model that infers comfortably on one GPU can require a cluster to fine-tune fully. This calculator is for estimating whether you can load and serve a model, not train it; for training, use a memory estimator built for that purpose.

Related tools

Sources