Guide · Memory types
VRAM, RAM and unified memory
Where a model actually lives decides how big it can be and how fast it runs. There are three kinds of memory it can use, and the same number of gigabytes is not equal across them.
GPU VRAM is fastest but smallest. System RAM is largest but slow for inference. Apple unified memory sits in between: one big pool shared by CPU and GPU. And on every one of them, usable memory is less than the sticker number.
The three kinds of memory
GPU VRAM (discrete graphics card)
Dedicated, very high bandwidth memory on the card. A model that fits here runs fastest. Limited in size: consumer cards top out around 24 to 32 GB, and you lose about 1 GB to the driver and display.
System RAM (CPU inference)
Ordinary computer memory. There is usually plenty of it, so you can load large models, but CPU memory bandwidth is far lower, so generation is slow. Leave several gigabytes for the operating system and apps.
Apple unified memory (Apple Silicon)
One pool shared by the CPU and GPU, so there is no separate VRAM. The GPU can use most of it for weights, which is why a high-memory Mac runs models a typical PC GPU cannot. macOS reserves part of the pool.
Worked example: what 16 GB really gives a model
Three machines all advertise 16 GB. Here is what each one can actually hand a model, anchored to real devices:
Apple unified memory
Apple M5 (16GB)
GPU VRAM
Nvidia GeForce RTX 4080 (16GB)
System RAM (CPU only)
16GB RAM Laptop (CPU/iGPU only)
Same 16 GB, different ceilings. The GPU hands over the most, the Mac the least, because macOS keeps a third of the pool. That is why "best LLM for 16 GB" has more than one answer; see the full breakdown on the 16 GB budget page.
Why usable is always less than total
- Discrete GPU: about 1 GB of VRAM goes to the driver and display.
- Apple Silicon: the Metal working-set limit is roughly 66% of unified memory under 64 GB, about 75% at or above.
- CPU-only: leave a few gigabytes for the OS and apps. Since that overhead is roughly fixed, the share you keep grows with total RAM: about 12 GB usable on a 16 GB laptop, more on 32 GB.
The site applies these rules everywhere; the exact figures and sources are in the methodology.
FAQ
Is VRAM or RAM better for running LLMs?
VRAM on a discrete GPU is much faster, because GPUs have far more memory bandwidth than CPU system memory. A model that fits in VRAM runs many times faster than the same model in system RAM on a CPU. The catch is VRAM is limited and expensive, so RAM lets you run bigger models, just slowly.
What is unified memory on a Mac?
Apple Silicon shares one pool of memory between the CPU and GPU, so there is no separate VRAM. The GPU can use most of that pool for model weights, which is why a 32 GB or 64 GB Mac can run models that would need an expensive multi-GPU rig on a PC. macOS reserves part of the pool, so roughly 66% is usable under 64 GB and about 75% at or above.
Why can't I use all my memory for the model?
Some is always reserved. A discrete GPU loses about 1 GB of VRAM to the driver and display. macOS caps the GPU working set at roughly two thirds of unified memory. A CPU-only machine needs several gigabytes for the operating system and apps. The usable figure, not the sticker number, is what a model has to fit in.
Can a model use VRAM and RAM at the same time?
Yes, by offloading. Runtimes can keep some layers in VRAM and the rest in system RAM, which lets a model run when it does not fully fit in VRAM. It works, but the part in RAM is slow, so throughput drops sharply the more you offload.