Guide · Architecture

Mixture of experts, explained

Mixture-of-experts models look like a free lunch: the speed of a small model with the knowledge of a large one. The catch is memory, and a lot of calculators get it wrong.

The short version

A MoE model splits its weights into many experts and uses only a few per token. That makes it fast like its small active part, but it needs memory for its full total size, because every expert has to be resident.

Two numbers, not one

A dense model has one parameter count that drives both speed and memory. A MoE has two: total parameters and active parameters. They pull in different directions.

Active parameters

How much compute runs per token. Drives speed.

Total parameters

Every expert, all resident. Drives memory.

A worked example: Qwen3 30B-A3B

Qwen3 30B-A3B has about 30.5B total parameters but activates only ~3.3B per token. So it generates at roughly the speed of a 3.3B model, while needing the memory of a 30.5B one: about 18.6 GB at Q4_K_M.

Total

30.5B

sets memory

Active

~3.3B

sets speed

Q4_K_M

~18.6 GB

has to fit

The mistake is to size it by the active 3.3B and expect it to fit a small machine. It will not. You need room for all 30.5B.

MoE models, by total size

Tracked mixture-of-experts models and the gap between what they cost in speed and in memory:

Model total / active

So when should you run a MoE?

When you have the memory for the total size and want more capability without slowing down. If memory is the constraint, compare by on-disk size, not by the active part: a MoE's total is what has to fit. Check any model against your hardware from your memory budget, and see the exact MoE memory rule in the methodology.

FAQ

Does a mixture-of-experts model use less memory?

No. A MoE model only computes with a few of its experts per token, so it is fast, but every expert has to be loaded in memory in case it is needed. Memory tracks the total parameter count, not the active count. A 30B MoE needs roughly the memory of a 30B dense model.

Why is a MoE model fast then?

Because only a small slice of the weights does work on each token. A 30B model with about 3B active parameters does roughly the compute of a 3B model per token, so it generates quickly, while still having the knowledge of a much larger model spread across its experts.

Should I run a MoE or a dense model?

If you have the memory for the total size, a MoE gives you more capability per unit of speed. If memory is tight, a dense model of the same on-disk size is usually the safer fit, because the MoE's total size is what has to fit, not its fast active part.

What does a tag like 30B-A3B mean?

It is total parameters and active parameters. 30B-A3B means about 30 billion total parameters with about 3 billion active per token. The first number drives memory, the second drives speed.

Mixture of experts, explained

Two numbers, not one

A worked example: Qwen3 30B-A3B

MoE models, by total size

So when should you run a MoE?

FAQ

Sources