Beyond text · 3 types

Best local image, video & audio models

Text is not the only thing you can run locally. Pick a type to see the models ranked by the GPU memory a run uses, with the tools and hardware each needs. Validated 2026-08-03.

Image generation

Stable Diffusion, FLUX, Qwen-Image and more.

6 models from ~3.7 GB
Video generation

Wan, LTX-Video, HunyuanVideo, CogVideoX and more.

11 models from ~6 GB
Speech & audio

Whisper, Kokoro, Orpheus, MusicGen: speech and music.

11 models from ~0.85 GB

Looking for text models? Browse all LLMs →