5 platforms · runtimes guide

Best tools to run LLMs locally

The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.

macOS runtime guide

Beginner pick

LM Studio

Polished GUI, ships MLX on Apple Silicon, one-click model downloads.

Power user

mlx-lm

Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.

$ pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit

Runtime	What it is	Status
MLX / mlx-lm	Apple-native inference, usually fastest on M-series	stable
LM Studio	GUI app, MLX + GGUF backends	stable
Ollama	CLI/server, simple ollama run UX	stable
llama.cpp (Metal)	Reference cross-platform engine	stable
Rapid-MLX	MLX inference server with an OpenAI-compatible API, prompt cache and tool calling; brew or pip install	stable

Gotcha: vLLM is NOT a Mac tool, it is a CUDA/Linux serving engine. Unified memory is not a fixed VRAM slice; ~70% is usable for weights.

Realistic ceiling M-series with 16GB runs up to ~8B comfortably; 64GB+ runs 70B at Q4.

Windows runtime guide

Beginner pick

LM Studio

Best GUI on Windows, auto-detects CUDA/Vulkan backends.

Power user

Ollama (CUDA)

Scriptable server; CUDA path is fastest on NVIDIA.

$ ollama run llama3.1:8b

Runtime	What it is	Status
LM Studio	GUI, CUDA/Vulkan/ROCm	stable
Ollama	CLI/server	stable
llama.cpp	Reference engine	stable

Gotcha: AMD GPUs run via Vulkan/ROCm at roughly half CUDA throughput. NVIDIA is the smooth path on Windows.

Realistic ceiling 12GB VRAM runs up to ~14B at Q4; 24GB runs 32B at Q4.

Linux runtime guide

Beginner pick

Ollama

One-line install, simple single-user UX.

Power user

vLLM

Highest throughput for serving many requests; PagedAttention + continuous batching.

$ pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct

Runtime	What it is	Status
vLLM	OpenAI-compatible serving engine	stable
Ollama	CLI/server	stable
llama.cpp	Reference engine	stable

Gotcha: vLLM shines for multi-user serving/throughput. For a single local chat, Ollama or llama.cpp is simpler and lighter.

Realistic ceiling Scales with VRAM; multi-GPU for 70B+ at higher precision.

iOS runtime guide

Beginner pick

Apple Foundation Models

Built into iOS 26, ~3B on-device model, zero download, fully private.

Power user

PocketPal AI

Run any GGUF from HuggingFace fully offline.

Runtime	What it is	Status
Apple Foundation Models	On-device ~3B model API (iOS 26+)	stable
PocketPal AI	llama.cpp wrapper app, any GGUF	stable
fullmoon	Open-source MLX Swift app (Apple Silicon)	stable
Enclave AI	Free offline assistant with on-device voice	stable
Private LLM	Paid app, 140+ models, Neural Engine accelerated	stable
Layla	Cross-platform offline app (GGUF + ExecuTorch)	stable
Google AI Edge Gallery	Google's on-device Gemma app	stable
LLM Farm	Open-source on-device runner	stable

Gotcha: Phones realistically run 1B-4B class models. Anything larger thermally throttles or OOMs.

Realistic ceiling 1B-4B at Q4 on 8GB iPhones; iPad Pro M4 (16GB) handles more.

Android runtime guide

Beginner pick

PocketPal AI

Polished app, download GGUF and run offline.

Power user

MLC LLM / LiteRT-LM

GPU/NPU acceleration paths for supported chips.

Runtime	What it is	Status
PocketPal AI	llama.cpp/llama.rn app	stable
MLC LLM	Compiled on-device inference	stable
Google AI Edge / LiteRT-LM	On-device LLM runtime for Gemma	stable
ChatterUI	Native llama.cpp GGUF frontend (APK)	stable
SmolChat	Open-source GGUF app (Play Store + GitHub)	stable
Layla	Cross-platform offline app	stable
Google AI Edge Gallery	Google's on-device Gemma app	stable

Gotcha: NPU acceleration is limited and chip-specific; most apps run on CPU. Expect 1B-4B class.

Realistic ceiling 1B-4B at Q4 on 8-16GB phones.

How fast is each tool?

How fast a model runs comes down to four things: your hardware, the model size, the quant, and how many requests hit it at once. The numbers below are decode speed, the tokens per second you actually watch stream out, for one request at a time. Two things worth knowing before you read them: decode and prefill are not the same metric, and one-user speed tells you nothing about how a tool holds up under a crowd.

Apple Silicon · 4-bit, single user (tok/s)

On a Mac, MLX is the one to beat. mlx-lm, and LM Studio which runs MLX under the hood, pull 20 to 55% more tokens per second than llama.cpp on dense models under 30B. Push past 30B and that lead fades, because now you are bandwidth-bound and the runtime barely matters. Ollama saw this coming and added an MLX backend in v0.19, which claws back most of the gap on 32GB+ Macs.

Hardware	Model	mlx-lm / LM Studio	Ollama (llama.cpp)	Source
M3 Max 64GB	Llama 3.1 8B	71	58	contracollective
M4 Max 128GB	Qwen3 8B	79.9	76.9	arXiv
M1 Max 64GB	Qwen2.5 7B	63.7	40.8	groundy.com
M3 Ultra 512GB	Gemma 3 27B	33	24	Google Cloud

One catch: MLX can bog down on prefill at long context (8K+ tokens), and llama.cpp sometimes chews through a big prompt faster even with slower decode (famstack.dev). And Rapid-MLX? It markets a 4.2x lead over Ollama, but that is one cherry-picked run against an old Ollama build. Nobody has independently reproduced anything close, so we are not putting a multiplier next to its name.

NVIDIA CUDA (tok/s)

If it is just you, Ollama and llama.cpp keep right up with vLLM. vLLM earns its keep when a crowd shows up: continuous batching lets it serve hundreds of requests at once without falling over. The batched column is peak total tokens per second at the concurrency shown in brackets.

Hardware	Model	Tool	1 user	Batched	Source
RTX 4090	Llama 3.1 8B Q4_K_M	llama.cpp	~113-128	n/a	hardware-corner
RTX 4090	Llama 3.1 8B Q4_K_M	Ollama	~95	n/a	databasemart
A100 40GB	Llama 3.1 8B FP16	Ollama	~45	41 @256	Red Hat
A100 40GB	Llama 3.1 8B FP16	vLLM	~38	793 @256	Red Hat
2x A100	Llama 70B FP16	llama.cpp	~17	196 @32	llama.cpp
2x A100	Llama 70B FP16	vLLM	~35	649 @32	llama.cpp

Pile 256 users onto an A100 and vLLM puts out roughly 19x the tokens Ollama manages. With one user, they are neck and neck. So: serving a lot of people, reach for vLLM (or SGLang); running a tool just for yourself, Ollama and llama.cpp are simpler and plenty fast.

FAQ

Is vLLM good for Mac?

No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.

Ollama or LM Studio?

LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.

What is the fastest way to run a model on Apple Silicon?

mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.

Best tools to run LLMs locally

Apple Silicon · 4-bit, single user (tok/s)

NVIDIA CUDA (tok/s)

Sources