Guide · Memory types

VRAM, RAM and unified memory

Where a model actually lives decides how big it can be and how fast it runs. There are three kinds of memory it can use, and the same number of gigabytes is not equal across them.

The short version

GPU VRAM is fastest but smallest. System RAM is largest but slow for inference. Apple unified memory sits in between: one big pool shared by CPU and GPU. And on every one of them, usable memory is less than the sticker number.

Is unified memory the same as VRAM?

Not the same, but close in practice. VRAM is dedicated memory on a discrete GPU. Unified memory, as on Apple Silicon, is one pool the CPU and GPU share, so there is no separate video memory. For running a model the two behave alike, because the GPU reads weights from fast memory either way. The practical difference: unified memory is shared with the whole system, so only about 66 to 75% of it is usable for a model, where a discrete GPU hands over nearly all its VRAM.

The three kinds of memory

GPU VRAM (discrete graphics card)

Dedicated, very high bandwidth memory on the card. A model that fits here runs fastest. Limited in size: consumer cards top out around 24 to 32 GB, and you lose about 1 GB to the driver and display.

System RAM (CPU inference)

Ordinary computer memory. There is usually plenty of it, so you can load large models, but CPU memory bandwidth is far lower, so generation is slow. Leave several gigabytes for the operating system and apps.

Apple unified memory (Apple Silicon)

One pool shared by the CPU and GPU, so there is no separate VRAM. The GPU can use most of it for weights, which is why a high-memory Mac runs models a typical PC GPU cannot. macOS reserves part of the pool.

Worked example: what 16 GB really gives a model

Three machines all advertise 16 GB. Here is what each one can actually hand a model, anchored to real devices:

Apple unified memory

Apple M5 (16GB)

~10.5 GB usable →

GPU VRAM

Nvidia GeForce RTX 4080 (16GB)

~15 GB usable →

System RAM (CPU only)

16GB RAM Laptop (CPU/iGPU only)

~12 GB usable →

Same 16 GB, different ceilings. The GPU hands over the most, the Mac the least, because macOS keeps a third of the pool. That is why "best LLM for 16 GB" has more than one answer; see the full breakdown on the 16 GB budget page.

Why usable is always less than total

Discrete GPU: about 1 GB of VRAM goes to the driver and display.
Apple Silicon: the Metal working-set limit is roughly 66% of unified memory under 64 GB, about 75% at or above.
CPU-only: leave a few gigabytes for the OS and apps. Since that overhead is roughly fixed, the share you keep grows with total RAM: about 12 GB usable on a 16 GB laptop, more on 32 GB.

The site applies these rules everywhere; the exact figures and sources are in the methodology.

FAQ

Is VRAM or RAM better for running LLMs?

VRAM on a discrete GPU is much faster, because GPUs have far more memory bandwidth than CPU system memory. A model that fits in VRAM runs many times faster than the same model in system RAM on a CPU. The catch is VRAM is limited and expensive, so RAM lets you run bigger models, just slowly.

What is unified memory on a Mac?

Apple Silicon shares one pool of memory between the CPU and GPU, so there is no separate VRAM. The GPU can use most of that pool for model weights, which is why a 32 GB or 64 GB Mac can run models that would need an expensive multi-GPU rig on a PC. macOS reserves part of the pool, so roughly 66% is usable under 64 GB and about 75% at or above.

Is unified memory the same as VRAM?

Not exactly. VRAM is dedicated memory on a discrete GPU; unified memory is one pool the CPU and GPU share, as on Apple Silicon. For running a model the effect is similar, since the GPU reads weights from fast memory either way. The difference is that unified memory is shared with the whole system, so only part of it (about 66% under 64 GB, 75% at or above) is usable for the model, where a discrete GPU gives you nearly all of its VRAM.

Why can't I use all my memory for the model?

Some is always reserved. A discrete GPU loses about 1 GB of VRAM to the driver and display. macOS caps the GPU working set at roughly two thirds of unified memory. A CPU-only machine needs several gigabytes for the operating system and apps. The usable figure, not the sticker number, is what a model has to fit in.

Can a model use VRAM and RAM at the same time?

Yes, by offloading. Runtimes can keep some layers in VRAM and the rest in system RAM, which lets a model run when it does not fully fit in VRAM. It works, but the part in RAM is slow, so throughput drops sharply the more you offload.

VRAM, RAM and unified memory

Is unified memory the same as VRAM?

The three kinds of memory

Worked example: what 16 GB really gives a model

Why usable is always less than total

FAQ

Sources