5 platforms · runtimes guide
Best tools to run LLMs locally
The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.
Polished GUI, ships MLX on Apple Silicon, one-click model downloads.
Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.
pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit | Runtime | What it is | Status |
|---|---|---|
| MLX / mlx-lm | Apple-native inference, usually fastest on M-series | stable |
| LM Studio | GUI app, MLX + GGUF backends | stable |
| Ollama | CLI/server, simple ollama run UX | stable |
| llama.cpp (Metal) | Reference cross-platform engine | stable |
| Rapid-MLX | MLX inference server with an OpenAI-compatible API, prompt cache and tool calling; brew or pip install | stable |
Best GUI on Windows, auto-detects CUDA/Vulkan backends.
Scriptable server; CUDA path is fastest on NVIDIA.
ollama run llama3.1:8b | Runtime | What it is | Status |
|---|---|---|
| LM Studio | GUI, CUDA/Vulkan/ROCm | stable |
| Ollama | CLI/server | stable |
| llama.cpp | Reference engine | stable |
One-line install, simple single-user UX.
Highest throughput for serving many requests; PagedAttention + continuous batching.
pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct | Runtime | What it is | Status |
|---|---|---|
| vLLM | OpenAI-compatible serving engine | stable |
| Ollama | CLI/server | stable |
| llama.cpp | Reference engine | stable |
Built into iOS 26, ~3B on-device model, zero download, fully private.
Run any GGUF from HuggingFace fully offline.
| Runtime | What it is | Status |
|---|---|---|
| Apple Foundation Models | On-device ~3B model API (iOS 26+) | stable |
| PocketPal AI | llama.cpp wrapper app | stable |
| LLM Farm | Open-source on-device runner | stable |
Polished app, download GGUF and run offline.
GPU/NPU acceleration paths for supported chips.
| Runtime | What it is | Status |
|---|---|---|
| PocketPal AI | llama.cpp/llama.rn app | stable |
| MLC LLM | Compiled on-device inference | stable |
| Google AI Edge / LiteRT-LM | On-device LLM runtime for Gemma | stable |
How fast is each tool?
How fast a model runs comes down to four things: your hardware, the model size, the quant, and how many requests hit it at once. The numbers below are decode speed, the tokens per second you actually watch stream out, for one request at a time. Two things worth knowing before you read them: decode and prefill are not the same metric, and one-user speed tells you nothing about how a tool holds up under a crowd.
Apple Silicon · 4-bit, single user (tok/s)
On a Mac, MLX is the one to beat. mlx-lm, and LM Studio which runs MLX under the hood, pull 20 to 55% more tokens per second than llama.cpp on dense models under 30B. Push past 30B and that lead fades, because now you are bandwidth-bound and the runtime barely matters. Ollama saw this coming and added an MLX backend in v0.19, which claws back most of the gap on 32GB+ Macs.
| Hardware | Model | mlx-lm / LM Studio | Ollama (llama.cpp) | Source |
|---|---|---|---|---|
| M3 Max 64GB | Llama 3.1 8B | 71 | 58 | contracollective |
| M4 Max 128GB | Qwen3 8B | 79.9 | 76.9 | arXiv |
| M1 Max 64GB | Qwen2.5 7B | 63.7 | 40.8 | groundy.com |
| M3 Ultra 512GB | Gemma 3 27B | 33 | 24 | Google Cloud |
One catch: MLX can bog down on prefill at long context (8K+ tokens), and llama.cpp sometimes chews through a big prompt faster even with slower decode (famstack.dev). And Rapid-MLX? It markets a 4.2x lead over Ollama, but that is one cherry-picked run against an old Ollama build. Nobody has independently reproduced anything close, so we are not putting a multiplier next to its name.
NVIDIA CUDA (tok/s)
If it is just you, Ollama and llama.cpp keep right up with vLLM. vLLM earns its keep when a crowd shows up: continuous batching lets it serve hundreds of requests at once without falling over. The batched column is peak total tokens per second at the concurrency shown in brackets.
| Hardware | Model | Tool | 1 user | Batched | Source |
|---|---|---|---|---|---|
| RTX 4090 | Llama 3.1 8B Q4_K_M | llama.cpp | ~113-128 | n/a | hardware-corner |
| RTX 4090 | Llama 3.1 8B Q4_K_M | Ollama | ~95 | n/a | databasemart |
| A100 40GB | Llama 3.1 8B FP16 | Ollama | ~45 | 41 @256 | Red Hat |
| A100 40GB | Llama 3.1 8B FP16 | vLLM | ~38 | 793 @256 | Red Hat |
| 2x A100 | Llama 70B FP16 | llama.cpp | ~17 | 196 @32 | llama.cpp |
| 2x A100 | Llama 70B FP16 | vLLM | ~35 | 649 @32 | llama.cpp |
Pile 256 users onto an A100 and vLLM puts out roughly 19x the tokens Ollama manages. With one user, they are neck and neck. So: serving a lot of people, reach for vLLM (or SGLang); running a tool just for yourself, Ollama and llama.cpp are simpler and plenty fast.
FAQ
Is vLLM good for Mac?
No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.
Ollama or LM Studio?
LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.
What is the fastest way to run a model on Apple Silicon?
mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.