Skip to content
localmodel.run

5 platforms · runtimes guide

Best tools to run LLMs locally

The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.

macOS runtime guide
Beginner pick
LM Studio

Polished GUI, ships MLX on Apple Silicon, one-click model downloads.

Power user
mlx-lm

Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.

$ pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit
Runtime What it is Status
MLX / mlx-lm Apple-native inference, usually fastest on M-series stable
LM Studio GUI app, MLX + GGUF backends stable
Ollama CLI/server, simple ollama run UX stable
llama.cpp (Metal) Reference cross-platform engine stable
Rapid-MLX MLX inference server with an OpenAI-compatible API, prompt cache and tool calling; brew or pip install stable
Gotcha: vLLM is NOT a Mac tool, it is a CUDA/Linux serving engine. Unified memory is not a fixed VRAM slice; ~70% is usable for weights.
Windows runtime guide
Beginner pick
LM Studio

Best GUI on Windows, auto-detects CUDA/Vulkan backends.

Power user
Ollama (CUDA)

Scriptable server; CUDA path is fastest on NVIDIA.

$ ollama run llama3.1:8b
Runtime What it is Status
LM Studio GUI, CUDA/Vulkan/ROCm stable
Ollama CLI/server stable
llama.cpp Reference engine stable
Gotcha: AMD GPUs run via Vulkan/ROCm at roughly half CUDA throughput. NVIDIA is the smooth path on Windows.
Linux runtime guide
Beginner pick
Ollama

One-line install, simple single-user UX.

Power user
vLLM

Highest throughput for serving many requests; PagedAttention + continuous batching.

$ pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct
Runtime What it is Status
vLLM OpenAI-compatible serving engine stable
Ollama CLI/server stable
llama.cpp Reference engine stable
Gotcha: vLLM shines for multi-user serving/throughput. For a single local chat, Ollama or llama.cpp is simpler and lighter.
iOS runtime guide
Beginner pick
Apple Foundation Models

Built into iOS 26, ~3B on-device model, zero download, fully private.

Power user
PocketPal AI

Run any GGUF from HuggingFace fully offline.

Runtime What it is Status
Apple Foundation Models On-device ~3B model API (iOS 26+) stable
PocketPal AI llama.cpp wrapper app stable
LLM Farm Open-source on-device runner stable
Gotcha: Phones realistically run 1B-4B class models. Anything larger thermally throttles or OOMs.
Android runtime guide
Beginner pick
PocketPal AI

Polished app, download GGUF and run offline.

Power user
MLC LLM / LiteRT-LM

GPU/NPU acceleration paths for supported chips.

Runtime What it is Status
PocketPal AI llama.cpp/llama.rn app stable
MLC LLM Compiled on-device inference stable
Google AI Edge / LiteRT-LM On-device LLM runtime for Gemma stable
Gotcha: NPU acceleration is limited and chip-specific; most apps run on CPU. Expect 1B-4B class.

How fast is each tool?

How fast a model runs comes down to four things: your hardware, the model size, the quant, and how many requests hit it at once. The numbers below are decode speed, the tokens per second you actually watch stream out, for one request at a time. Two things worth knowing before you read them: decode and prefill are not the same metric, and one-user speed tells you nothing about how a tool holds up under a crowd.

Apple Silicon · 4-bit, single user (tok/s)

On a Mac, MLX is the one to beat. mlx-lm, and LM Studio which runs MLX under the hood, pull 20 to 55% more tokens per second than llama.cpp on dense models under 30B. Push past 30B and that lead fades, because now you are bandwidth-bound and the runtime barely matters. Ollama saw this coming and added an MLX backend in v0.19, which claws back most of the gap on 32GB+ Macs.

Hardware Model mlx-lm / LM Studio Ollama (llama.cpp) Source
M3 Max 64GB Llama 3.1 8B 71 58 contracollective
M4 Max 128GB Qwen3 8B 79.9 76.9 arXiv
M1 Max 64GB Qwen2.5 7B 63.7 40.8 groundy.com
M3 Ultra 512GB Gemma 3 27B 33 24 Google Cloud

One catch: MLX can bog down on prefill at long context (8K+ tokens), and llama.cpp sometimes chews through a big prompt faster even with slower decode (famstack.dev). And Rapid-MLX? It markets a 4.2x lead over Ollama, but that is one cherry-picked run against an old Ollama build. Nobody has independently reproduced anything close, so we are not putting a multiplier next to its name.

NVIDIA CUDA (tok/s)

If it is just you, Ollama and llama.cpp keep right up with vLLM. vLLM earns its keep when a crowd shows up: continuous batching lets it serve hundreds of requests at once without falling over. The batched column is peak total tokens per second at the concurrency shown in brackets.

Hardware Model Tool 1 user Batched Source
RTX 4090 Llama 3.1 8B Q4_K_M llama.cpp ~113-128 n/a hardware-corner
RTX 4090 Llama 3.1 8B Q4_K_M Ollama ~95 n/a databasemart
A100 40GB Llama 3.1 8B FP16 Ollama ~45 41 @256 Red Hat
A100 40GB Llama 3.1 8B FP16 vLLM ~38 793 @256 Red Hat
2x A100 Llama 70B FP16 llama.cpp ~17 196 @32 llama.cpp
2x A100 Llama 70B FP16 vLLM ~35 649 @32 llama.cpp

Pile 256 users onto an A100 and vLLM puts out roughly 19x the tokens Ollama manages. With one user, they are neck and neck. So: serving a lot of people, reach for vLLM (or SGLang); running a tool just for yourself, Ollama and llama.cpp are simpler and plenty fast.

FAQ

Is vLLM good for Mac?

No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.

Ollama or LM Studio?

LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.

What is the fastest way to run a model on Apple Silicon?

mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.

Sources