Guide · Quantization

Quantization, in plain words

Every local model lists cryptic tags like Q4_K_M or Q8_0. They describe how much the model has been compressed to fit in memory. Here is what they mean and which one to pick.

The short version

Use Q4_K_M. It keeps almost all of the quality at about a third of the original size, so it fits where the full model never would. Step up to Q8_0 only if you have memory to spare; drop below Q4 only when nothing else fits.

What the letters and numbers mean

A raw model stores each weight in 16 bits (FP16). Quantization stores each weight in fewer bits, which shrinks the file and the memory it needs. The tag breaks down like this:

Q means quantized.
The number (2, 3, 4, 5, 6, 8) is roughly the bits kept per weight. Fewer bits means a smaller file and lower quality.
K means a k-quant: a smarter method that spends more bits on the weights that matter most and fewer elsewhere.
S / M / L is the small, medium or large mix of that method. Medium (M) is the usual choice.

So Q4_K_M reads as: 4-bit k-quant, medium mix. The bits-per-weight figures come from the llama.cpp quantize table.

A worked example: Llama 3.1 8B

Here is the same 8B model at each quant. Sizes marked measured are read from real GGUF files; the rest are computed from bits-per-weight.

Llama 3.1 8B on-disk size

Q2_K ~3.4 GB estimated
Q3_K_M ~3.9 GB estimated
Q4_K_M ~4.92 GB measured
Q5_K_M ~5.7 GB estimated
Q6_K ~6.6 GB estimated
Q8_0 ~8.54 GB measured
FP16 ~16 GB estimated

The jump from Q4_K_M to Q8_0 nearly doubles the size for quality most people will not notice in chat. That is why Q4_K_M is the default almost everywhere.

Which quant should you pick?

Q4_K_M, the default

Best balance of size and quality. Start here for any model that fits.

Q5_K_M or Q6_K, a small step up

A little more quality if you have the memory and want it. Diminishing returns.

Q8_0, near-lossless

About 1.7x the size of Q4_K_M. Worth it only when memory is not the constraint.

Q3_K_M or Q2_K, last resort

Use only to squeeze a model that otherwise will not fit. Quality drops noticeably, especially at Q2.

How this changes what you can run

Quant choice is the single biggest lever on whether a model fits your hardware. The same model at Q4_K_M might fit a 16 GB machine where its FP16 version needs more than three times the memory. To see what fits yours, start from your memory budget or your exact device. For the full memory math, see the methodology.

FAQ

What does Q4_K_M mean?

Q means quantized. The 4 is the rough number of bits used per weight (down from 16 in the original model). K means a k-quant, a smarter scheme that spends more bits on the layers that matter. M is the medium-size mix of that scheme. So Q4_K_M is a 4-bit k-quant, medium mix: the community default for local use.

Is Q4_K_M good enough, or should I use Q8_0?

Q4_K_M keeps almost all of the model's quality at roughly a third of the FP16 size, so it is the right default. Q8_0 is near-lossless but about 1.7x the size of Q4_K_M, worth it only if you have memory to spare. Below Q4 (Q3, Q2) quality drops off faster and is a last resort for models that otherwise will not fit.

Does a lower quant make the model faster?

Somewhat. Local inference is mostly limited by memory bandwidth, so a smaller quant moves fewer bytes per token and can run a little faster, on top of fitting in less memory. The main reason to drop to Q4 is fit, not speed.

What is FP16 and do I need it locally?

FP16 is the near-original 16-bit precision. It is roughly 3.3x the size of Q4_K_M (16 bits versus an effective ~4.9 per weight) for almost no quality gain you would notice in chat, so locally it is rarely worth the memory. It matters more for fine-tuning than for running.

Quantization, in plain words

What the letters and numbers mean

A worked example: Llama 3.1 8B

Which quant should you pick?

How this changes what you can run

FAQ

Sources