Catalog · Speech & audio

Best local speech & audio models

Local speech-to-text, text-to-speech and music models, ranked by memory. The speech models are light enough to run on almost anything; music generation wants a GPU.

Models: 11
Lightest: ~0.85 GB
Heaviest: ~20 GB

Speech to text

Lightest first peak VRAM

Whisper small
Speech to text

~0.85 GB

fp16 (whisper.cpp)

Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

Tools: whisper.cpp, faster-whisper, MacWhisper
Whisper large-v3-turbo
Speech to text

~1.5 GB

int8

Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

Tools: whisper.cpp, faster-whisper, MacWhisper
Whisper large-v3
Speech to text

~2.5 GB

int8

Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

Tools: whisper.cpp, faster-whisper, MacWhisper

Text to speech

Lightest first peak VRAM

Kokoro-82M
Text to speech

~1 GB

fp32

Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

Tools: kokoro (Python), ONNX Runtime, Kokoro-FastAPI
Orpheus 3B
Text to speech

~4 GB

Q4_K_M GGUF

Runs on: MacNVIDIAAMDCPU laptop

Tools: llama.cpp, LM Studio, vLLM
Bark
Text to speech (and sound effects)

~5 GB

fp32

Runs on: MacNVIDIAAMDCPU laptop

Tools: HF Transformers, suno-ai/bark, BetterTransformer
Dia 1.6B
Text to speech (dialogue)

~10 GB

fp16

Runs on: NVIDIA

Tools: PyTorch (CUDA), nari-labs/dia

Music generation

Lightest first peak VRAM

MusicGen small
Music generation

~3 GB

fp32

Runs on: MacNVIDIAAMD

Tools: AudioCraft, HF Transformers
MusicGen medium
Music generation

~14 GB

fp32

Runs on: NVIDIAAMD

Tools: AudioCraft, HF Transformers
Stable Audio Open 1.0
Music and audio generation

~15 GB

fp32

Runs on: NVIDIAAMD

Tools: stable-audio-tools, diffusers, ComfyUI
MusicGen large
Music generation

~20 GB

fp32

Runs on: NVIDIA

Tools: AudioCraft, HF Transformers

Peak VRAM is the memory a run consumes, the same basis the site uses everywhere; see the methodology. To check a model against your exact device, open its compatibility page.

FAQ

What is the most memory-efficient local speech & audio model?

Whisper small uses the least: about 0.85 GB at fp16 (whisper.cpp). Lower memory means it runs on more hardware.

How much GPU memory do I need for local speech & audio?

It ranges from about 0.85 GB to 20 GB of peak VRAM across the models here. The figure is the memory a run consumes, not the size of card you must buy, so match it to your usable VRAM with a gigabyte or two of margin.

Is the memory figure the download size or the run size?

The run size: peak VRAM actually consumed during generation, which is the number that decides if it fits. Diffusion models can also offload parts to system RAM to run on less, slower. Every figure here is sourced.

Best local speech & audio models

Speech to text

Text to speech

Music generation

FAQ

Sources