Catalog · Speech & audio
Best local speech & audio models
Local speech-to-text, text-to-speech and music models, ranked by memory. The speech models are light enough to run on almost anything; music generation wants a GPU.
- Models
- 11
- Lightest
- ~0.85 GB
- Heaviest
- ~20 GB
Speech to text
- Whisper smallSpeech to text~0.85 GBfp16 (whisper.cpp)Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid
Tools: whisper.cpp, faster-whisper, MacWhisper
- Whisper large-v3-turboSpeech to text~1.5 GBint8Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid
Tools: whisper.cpp, faster-whisper, MacWhisper
- Whisper large-v3Speech to text~2.5 GBint8Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid
Tools: whisper.cpp, faster-whisper, MacWhisper
Text to speech
- Kokoro-82MText to speech~1 GBfp32Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid
Tools: kokoro (Python), ONNX Runtime, Kokoro-FastAPI
- Orpheus 3BText to speech~4 GBQ4_K_M GGUFRuns on: MacNVIDIAAMDCPU laptop
Tools: llama.cpp, LM Studio, vLLM
- BarkText to speech (and sound effects)~5 GBfp32Runs on: MacNVIDIAAMDCPU laptop
Tools: HF Transformers, suno-ai/bark, BetterTransformer
- Dia 1.6BText to speech (dialogue)~10 GBfp16Runs on: NVIDIA
Tools: PyTorch (CUDA), nari-labs/dia
Music generation
- MusicGen smallMusic generation~3 GBfp32Runs on: MacNVIDIAAMD
Tools: AudioCraft, HF Transformers
- MusicGen mediumMusic generation~14 GBfp32Runs on: NVIDIAAMD
Tools: AudioCraft, HF Transformers
- Stable Audio Open 1.0Music and audio generation~15 GBfp32Runs on: NVIDIAAMD
Tools: stable-audio-tools, diffusers, ComfyUI
- MusicGen largeMusic generation~20 GBfp32Runs on: NVIDIA
Tools: AudioCraft, HF Transformers
Peak VRAM is the memory a run consumes, the same basis the site uses everywhere; see the methodology. To check a model against your exact device, open its compatibility page.
FAQ
What is the most memory-efficient local speech & audio model?
Whisper small uses the least: about 0.85 GB at fp16 (whisper.cpp). Lower memory means it runs on more hardware.
How much GPU memory do I need for local speech & audio?
It ranges from about 0.85 GB to 20 GB of peak VRAM across the models here. The figure is the memory a run consumes, not the size of card you must buy, so match it to your usable VRAM with a gigabyte or two of margin.
Is the memory figure the download size or the run size?
The run size: peak VRAM actually consumed during generation, which is the number that decides if it fits. Diffusion models can also offload parts to system RAM to run on less, slower. Every figure here is sourced.