Question 1

Does quantization (int8, int4) come for free?

Accepted Answer

No — it trades quality for memory. Casting weights from fp16 down to int8 halves the footprint, and int4 quarters it, which is why a model that needs a data-center GPU at fp16 can sometimes run on a consumer card when quantized. But quantization is lossy: it approximates each weight with fewer bits, and lower precision can degrade output quality, especially for smaller models or demanding tasks. Modern quantization methods keep the loss small, but you should benchmark accuracy on your own workload rather than assume int4 is a drop-in replacement for fp16.

Question 2

Why is the overhead just a flat percentage?

Accepted Answer

The 20% default is a deliberate simplification, and this is a rough estimate. Real inference overhead — mostly the KV-cache that stores attention state — scales with how many tokens are in context and how many requests run in parallel. A short single prompt might add far less than 20%, while a long-context or high-batch serving setup can add much more, sometimes exceeding the weights themselves. Treat the total here as a floor for a modest workload and add headroom for your actual context length and concurrency.

Question 3

Does this cover training, or only inference?

Accepted Answer

Only inference. Training needs far more VRAM than loading weights: on top of the parameters you must hold gradients, optimizer states (Adam keeps two extra values per parameter), and cached activations for backpropagation — often several times the model's own size. A model that infers comfortably on one GPU can require a cluster to fine-tune fully. This calculator is for estimating whether you can load and serve a model, not train it; for training, use a memory estimator built for that purpose.

GPU VRAM Calculator (Model Size & GPU Fit)

How it works

Frequently asked questions

Related tools

Sources