Skip to content
localmodel.run

Catalog · Speech & audio

Best local speech & audio models

Local speech-to-text, text-to-speech and music models, ranked by memory. The speech models are light enough to run on almost anything; music generation wants a GPU.

Models
11
Lightest
~0.85 GB
Heaviest
~20 GB

Speech to text

Lightest first peak VRAM
  • Whisper small
    Speech to text
    ~0.85 GB
    fp16 (whisper.cpp)
    Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

    Tools: whisper.cpp, faster-whisper, MacWhisper

  • Whisper large-v3-turbo
    Speech to text
    ~1.5 GB
    int8
    Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

    Tools: whisper.cpp, faster-whisper, MacWhisper

  • Whisper large-v3
    Speech to text
    ~2.5 GB
    int8
    Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

    Tools: whisper.cpp, faster-whisper, MacWhisper

Text to speech

Lightest first peak VRAM
  • Kokoro-82M
    Text to speech
    ~1 GB
    fp32
    Runs on: MacNVIDIAAMDCPU laptopiPhoneAndroid

    Tools: kokoro (Python), ONNX Runtime, Kokoro-FastAPI

  • Orpheus 3B
    Text to speech
    ~4 GB
    Q4_K_M GGUF
    Runs on: MacNVIDIAAMDCPU laptop

    Tools: llama.cpp, LM Studio, vLLM

  • Bark
    Text to speech (and sound effects)
    ~5 GB
    fp32
    Runs on: MacNVIDIAAMDCPU laptop

    Tools: HF Transformers, suno-ai/bark, BetterTransformer

  • Dia 1.6B
    Text to speech (dialogue)
    ~10 GB
    fp16
    Runs on: NVIDIA

    Tools: PyTorch (CUDA), nari-labs/dia

Music generation

Lightest first peak VRAM
  • MusicGen small
    Music generation
    ~3 GB
    fp32
    Runs on: MacNVIDIAAMD

    Tools: AudioCraft, HF Transformers

  • MusicGen medium
    Music generation
    ~14 GB
    fp32
    Runs on: NVIDIAAMD

    Tools: AudioCraft, HF Transformers

  • Stable Audio Open 1.0
    Music and audio generation
    ~15 GB
    fp32
    Runs on: NVIDIAAMD

    Tools: stable-audio-tools, diffusers, ComfyUI

  • MusicGen large
    Music generation
    ~20 GB
    fp32
    Runs on: NVIDIA

    Tools: AudioCraft, HF Transformers

Peak VRAM is the memory a run consumes, the same basis the site uses everywhere; see the methodology. To check a model against your exact device, open its compatibility page.

FAQ

What is the most memory-efficient local speech & audio model?

Whisper small uses the least: about 0.85 GB at fp16 (whisper.cpp). Lower memory means it runs on more hardware.

How much GPU memory do I need for local speech & audio?

It ranges from about 0.85 GB to 20 GB of peak VRAM across the models here. The figure is the memory a run consumes, not the size of card you must buy, so match it to your usable VRAM with a gigabyte or two of margin.

Is the memory figure the download size or the run size?

The run size: peak VRAM actually consumed during generation, which is the number that decides if it fits. Diffusion models can also offload parts to system RAM to run on less, slower. Every figure here is sourced.

Sources