Beyond text · 3 types
Best local image, video & audio models
Text is not the only thing you can run locally. Pick a type to see the models ranked by the GPU memory a run uses, with the tools and hardware each needs. Validated 2026-06-15.
- Image generation
Stable Diffusion, FLUX, Qwen-Image and more.
6 models from ~3.7 GB - Video generation
Wan, LTX-Video, HunyuanVideo, CogVideoX and more.
11 models from ~6 GB - Speech & audio
Whisper, Kokoro, Orpheus, MusicGen: speech and music.
11 models from ~0.85 GB
Looking for text models? Browse all LLMs →