← Back to Quick TTS

Every Piper Voice, Ranked: The Practical Guide to Piper's voices.json

Piper ships hundreds of voices across dozens of languages, and the official catalog is a flat voices.json file with no opinion on which ones are actually good. We listened to every English voice in there, plus the strongest non-English picks, and wrote down what would have saved us a week of A/B-ing in Quick TTS. If you're building anything that ships Piper, this is the shortlist.

What Piper actually is, and why the catalog is confusing

Piper is a neural TTS toolkit that came out of the Rhasspy project — originally a fast, offline voice assistant for Raspberry Pi. The architecture is VITS (a variational autoencoder + adversarial training combo from 2021) trained at relatively small parameter counts so a single utterance renders in well under real-time even on a Pi 4. The browser path uses vits-web, an ONNX/WASM build of those models.

The catalog is confusing because it's a community catalog. Anyone with a recording corpus and a few days of GPU time can train and contribute a voice; the Rhasspy team merges the ONNX export and a metadata blob into voices.json. There's no curation step, no quality bar, no "this voice is approved for production" flag. The result is a catalog where en_US-libritts_r-medium (excellent, audiobook-grade) sits next to en_US-arctic-medium (a 2003 corpus that sounds it). Making sense of it takes listening. We did the listening.

How quality is graded in voices.json

Each voice carries a quality tier — x_low, low, medium, or high — that is a useful but imperfect signal. The tiers correspond to model and audio configuration:

The tier is necessary but not sufficient. en_US-libritts_r-medium sounds better than en_US-l2arctic-medium despite being the same tier, because the underlying corpus matters more than the model size. Listen first, trust the tier label second.

Piper quality tiers, what the file size buys you, and when each tier is the right pick for browser use.
Tier Sample rate Approx. model size When to pick it
x_low 16 kHz ~10 MB Microcontroller targets, or languages where x_low is the only option (e.g. Vietnamese vivos).
low 16 kHz ~20 MB Embedded voice assistants, short utterances. Avoid for long-form reading.
medium 22.05 kHz ~30–60 MB Browser default. Best balance of quality, download size, and CPU cost.
high 22.05 kHz ~100 MB When bandwidth isn't a concern and the underlying corpus is strong (Lessac, Thorsten, Cori).

What to listen for when ranking a Piper voice

"Sounds good" is not a useful evaluation metric. After enough listening hours we settled on a small checklist that catches almost every voice's weaknesses inside one paragraph of test text:

The five English voices we'd actually ship

If you're picking from Piper's English catalog, here are the five worth your attention. Voice IDs match the catalog in huggingface.co/rhasspy/piper-voices:

Quick TTS' config.js ships en_US-libritts_r-medium, en_US-joe-medium, and en_GB-vctk-medium as the English Piper allowlist. en_US-joe-medium covers the same niche as hfc_male with a slightly cleaner corpus; we picked it for download-size reasons more than quality.

Voices we'd skip

Some Piper voices are dated, some have prosody bugs that didn't get cleaned up before release, and some are research artifacts that shouldn't really be in a user-facing catalog. Honest call-outs:

Non-English Piper voices worth knowing about

Piper's non-English coverage is uneven, but the standouts are very strong. The voices below are pulled directly from the public catalog — when you pick them in Quick TTS, the model file downloads on first use and caches afterwards.

The licensing minefield

Piper voices are published under a mix of licenses, and some of them are not commercial-OK. The catalog includes CC-BY 4.0, CC0, MIT, Apache 2.0, and several non-commercial-research-only voices. The license is in each voice's MODEL_CARD on huggingface — there's no central "is_commercial_ok" flag in voices.json itself.

Practical implication if you're shipping Piper in a product: don't enumerate voices.json at runtime and offer everything. Build an explicit allowlist and audit the licenses once. Quick TTS' config.js keeps a per-locale allowlist (AI_TTS_PIPER_VOICES_BY_LANG) that has been hand-checked against each model card; voices we couldn't verify as commercial-permissive don't appear, even when they sound great. If you're embedding Piper for a non-commercial or research project the bar is lower, but the same audit step is worth doing once so you know what you've shipped.

Why we don't ship every voice

The other reason Quick TTS' Piper allowlist is short: download budget. Each medium-quality Piper voice is roughly 60 MB. If we surfaced every English voice in the catalog (35+ at last count), the user's first interaction with the voice picker would either trigger 2 GB of background downloads or — more realistically — would be one of those frustrating UIs where every voice option has its own loading spinner and you can't tell which one will actually work without clicking and waiting.

Three voices per language is the sweet spot we landed on after a few iterations: enough variety that a user who doesn't like the default has a real alternative, few enough that the catalog feels curated rather than dumped. The cost of being more opinionated is that some genuinely good voices (looking at you, en_US-bryce-medium, en_US-kristin-medium) don't make the cut. If you have a strong feeling about one of them, the FAQ has a contact link and we read every request.

Quick TTS' actual allowlist, with reasoning

For full transparency, here's what config.js ships per locale and the one-line reason each voice is on the list. The allowlist comes from AI_TTS_PIPER_VOICES_BY_LANG:

Quick TTS' per-locale Piper allowlist (excerpt). Each voice's license has been verified against its MODEL_CARD on huggingface.co/rhasspy/piper-voices.
Locale Voice IDs shipped Why
en en_US-libritts_r-medium, en_US-joe-medium, en_GB-vctk-medium Audiobook default + male alternative + British multi-speaker. Three voices covers the long tail without bloating the picker.
de de_DE-thorsten-medium Best single-speaker open German neural voice in any catalog. We considered shipping the high tier but kept the medium for download-size reasons.
es es_ES-davefx-medium, es_MX-claude-high Iberian + Mexican coverage. Argentine Spanish (daniela-high) is arguably better but adds a third voice; we left it out and may revisit.
fr fr_FR-siwis-medium Most consistent French voice. fr_FR-tom-medium is warmer but the SIWIS prosody is more reliable for long-form content.
it it_IT-paola-medium The only shippable Italian voice in the catalog at this tier. Piper's Italian shelf is genuinely thin.
nl nl_NL-mls-medium Cleanest Netherlands Dutch. We deliberated on adding a Flemish (nl_BE) voice and decided one voice was enough for v1.
vi vi_VN-vais1000-medium Best Vietnamese option, period. Piper has only three Vietnamese voices total; this is the only one above x_low.

The full table covers 14 locales. For Japanese and Korean, config.js intentionally has no Piper entry — the Piper community has not produced viable voices for either language, and Quick TTS routes those locales straight to Kokoro (Japanese has 5 Kokoro voices) or Web Speech (Korean has 8 Edge Online voices). Honest caveat: Korean is the language Quick TTS covers least well, and we don't try to hide it.

Try them yourself

Open Quick TTS, switch the engine selector to Piper, and audition the shipped voices on whatever paragraph you're trying to listen to. The first time you select a voice it downloads the model (typically 5–10 seconds on a fast connection); every subsequent use is instant from browser cache.

If you want the broader picture on how Piper compares to the other browser TTS options, our deep-dive on Web Speech API vs Piper vs Kokoro has the side-by-side. The best-free-TTS-voices post is the higher-level "which voices should I bother with at all" version of this same exercise. The FAQ covers integration questions; the project home is at Quick TTS.