Every Piper Voice, Ranked: The Practical Guide to Piper's voices.json
Piper ships hundreds of voices across dozens of languages, and the official catalog is a flat voices.json file with no opinion on which ones are actually good. We listened to every English voice in there, plus the strongest non-English picks, and wrote down what would have saved us a week of A/B-ing in Quick TTS. If you're building anything that ships Piper, this is the shortlist.
What Piper actually is, and why the catalog is confusing
Piper is a neural TTS toolkit that came out of the Rhasspy project — originally a fast, offline voice assistant for Raspberry Pi. The architecture is VITS (a variational autoencoder + adversarial training combo from 2021) trained at relatively small parameter counts so a single utterance renders in well under real-time even on a Pi 4. The browser path uses vits-web, an ONNX/WASM build of those models.
The catalog is confusing because it's a community catalog. Anyone with a recording corpus and a few days of GPU time can train and contribute a voice; the Rhasspy team merges the ONNX export and a metadata blob into voices.json. There's no curation step, no quality bar, no "this voice is approved for production" flag. The result is a catalog where en_US-libritts_r-medium (excellent, audiobook-grade) sits next to en_US-arctic-medium (a 2003 corpus that sounds it). Making sense of it takes listening. We did the listening.
How quality is graded in voices.json
Each voice carries a quality tier — x_low, low, medium, or high — that is a useful but imperfect signal. The tiers correspond to model and audio configuration:
- x_low — 16 kHz output, smallest model. Aimed at Raspberry Pi Zero and similar microcontroller-class devices. Audibly compressed and "telephony-grade". Skip in browser unless bandwidth is genuinely the bottleneck.
- low — 16 kHz output, slightly larger model. Better than x_low, still distinctly synthetic. Acceptable for short utterances in voice assistants; not for long-form reading.
- medium — 22.05 kHz output, the default tier most languages have. ~30–60 MB ONNX file. This is the sweet spot for browser use — quality is comparable to a 2021 commercial neural TTS, and the download fits in roughly one Wi-Fi second.
- high — 22.05 kHz output, larger model trained on more data per speaker. Roughly 100 MB. Noticeably better than medium when the underlying corpus is good (LibriTTS, Lessac), only marginally better when it isn't.
The tier is necessary but not sufficient. en_US-libritts_r-medium sounds better than en_US-l2arctic-medium despite being the same tier, because the underlying corpus matters more than the model size. Listen first, trust the tier label second.
| Tier | Sample rate | Approx. model size | When to pick it |
|---|---|---|---|
| x_low | 16 kHz | ~10 MB | Microcontroller targets, or languages where x_low is the only option (e.g. Vietnamese vivos). |
| low | 16 kHz | ~20 MB | Embedded voice assistants, short utterances. Avoid for long-form reading. |
| medium | 22.05 kHz | ~30–60 MB | Browser default. Best balance of quality, download size, and CPU cost. |
| high | 22.05 kHz | ~100 MB | When bandwidth isn't a concern and the underlying corpus is strong (Lessac, Thorsten, Cori). |
What to listen for when ranking a Piper voice
"Sounds good" is not a useful evaluation metric. After enough listening hours we settled on a small checklist that catches almost every voice's weaknesses inside one paragraph of test text:
- Sentence-final intonation. Does the voice fall on a period and rise on a question mark? Many community-trained Piper voices have the prosody flipped or compressed — every sentence ends on the same pitch regardless of punctuation. Test sentence: "It's getting late. Are we leaving yet?"
- Comma timing. A natural reader takes a beat at each comma. Bad VITS training merges clauses without pause. Test sentence: "She turned, paused for a moment, then walked out without saying anything."
- Number reading. "404" should be "four-oh-four" or "four hundred and four", not "four zero four". Most Piper voices handle digits fine but a few — including some of the Arctic variants — read them digit-by-digit.
- Sustained vowels. The "VITS shimmer" — a slight tremolo on long vowels — is the single most consistent giveaway that audio is synthetic. Higher-tier voices reduce it; the high-tier Lessac and Thorsten voices have it almost entirely controlled.
- Em-dash handling. Em-dashes should produce a thought-pause shorter than a period and longer than a comma. Most voices treat them as commas, which is acceptable. A few treat them as hard stops, which sounds wrong. Test sentence: "He hesitated — knowing the consequences — then signed."
The five English voices we'd actually ship
If you're picking from Piper's English catalog, here are the five worth your attention. Voice IDs match the catalog in huggingface.co/rhasspy/piper-voices:
en_US-libritts_r-medium— the audiobook default. If you have to pick one English voice, pick this. It's trained on the LibriTTS-R corpus (a re-recorded subset of LibriVox audiobooks), so the prosody is paragraph-aware in a way that conversational corpora aren't. Sentence-final intonation lands naturally; commas get the right length of pause. Quick TTS uses it as the default Piper voice and we've never regretted it.en_US-amy-medium— the conversational alternative. Amy is what you reach for when LibriTTS sounds too "performed" — think notification readbacks, short alerts, or chat-like content. Slightly less polished on long passages, but warmer on single sentences.en_GB-vctk-medium— multi-speaker British. VCTK is a 109-speaker corpus from the University of Edinburgh. The Piper port exposes the speakers as voice IDs within the model (you pass a speaker index at inference time). Practical use: pick a single VCTK speaker that fits your tone, ship that one. The corpus quality is high and the Received Pronunciation accent is the closest Piper gets to a BBC reader.en_US-lessac-high— the highest-quality English voice in the catalog. Lessac is an audiobook-style corpus with a single speaker (the narrator who recorded the original Lessac dataset for academic TTS work). The "high" model is roughly twice the file size of medium and you can hear the difference — sustained vowels are smoother, the "VITS shimmer" is reduced. Pick this one when bandwidth isn't a concern and you want the best Piper can do.en_US-hfc_male-medium— the male alternative. Most of the strongest single-speaker Piper corpora are female (Amy, Lessac, the LibriTTS speakers we tend to surface). hfc_male is the cleanest male voice in the catalog without going to the multi-speaker VCTK route. Slightly more "regional American" than the others, in a way that works for conversational content.
Quick TTS' config.js ships en_US-libritts_r-medium, en_US-joe-medium, and en_GB-vctk-medium as the English Piper allowlist. en_US-joe-medium covers the same niche as hfc_male with a slightly cleaner corpus; we picked it for download-size reasons more than quality.
Voices we'd skip
Some Piper voices are dated, some have prosody bugs that didn't get cleaned up before release, and some are research artifacts that shouldn't really be in a user-facing catalog. Honest call-outs:
en_US-arctic-medium— dated. The Arctic corpus is from 2003. Even with a 2024-era VITS model on top, the source audio's mid-2000s recording quality bleeds through. Sounds like an early-Windows TTS in a way the others don't.en_US-norman-medium— consistent prosody issues. Norman has a recurring tendency to end declarative sentences on a rising tone, which makes everything sound like a question. We assume the training data had a regional speaker style that didn't generalize cleanly. You'll notice it within the first paragraph of any article.en_US-l2arctic-medium— research model, not a polished voice. L2-ARCTIC is a corpus of non-native English speakers used for accent-recognition research. The Piper port faithfully reproduces accents from that corpus, which is interesting academically but not what most users want when they paste an article. Skip unless you're specifically studying L2 prosody.en_US-kathleen-low,en_US-ryan-low,en_US-amy-low— the "low" variants exist for embedded contexts. They're not strictly bad, just clearly worse than their medium siblings while being only slightly smaller. In a browser there's no reason to pick them; pay the extra ~30 MB and ship the medium tier.en_GB-southern_english_female-low,en_GB-northern_english_male-medium— interesting in concept, rough in execution. Both are accent-specific corpora that haven't quite found enough training data to compete. The British flavor is real but the artifact rate (clicks, breathiness, occasional dropped phonemes) is higher than VCTK or Cori.
Non-English Piper voices worth knowing about
Piper's non-English coverage is uneven, but the standouts are very strong. The voices below are pulled directly from the public catalog — when you pick them in Quick TTS, the model file downloads on first use and caches afterwards.
de_DE-thorsten-high— German. The Thorsten corpus (recorded by an open-source contributor over several years) is one of the best single-speaker open neural TTS datasets in any language. The high model is genuinely audiobook-grade.de_DE-thorsten-mediumis also excellent if bandwidth matters.de_DE-thorsten_emotional-medium— German with emotion markers. Same speaker as thorsten, plus labelled emotion tokens (neutral, happy, angry, surprised, sad, disgusted) embedded in the model. You drive emotion via the speaker index parameter. A genuine novelty in the open-source TTS world; nothing else in voices.json offers anything comparable.es_AR-daniela-high— Argentine Spanish. Most Spanish TTS defaults to Castilian (es_ES) or Mexican (es_MX). Daniela is one of very few high-quality neural voices for Argentine Spanish anywhere — open-source or commercial. If you have rioplatense-Spanish content, this is the only practical browser option.es_MX-claude-high— Mexican Spanish. Solid neutral Latin American Spanish at the high tier. Quick TTS uses it as the secondary Spanish voice behindes_ES-davefx-medium.fr_FR-tom-medium— French. Cleaner than the SIWIS-derived voices for long-form prose. SIWIS is the more academically standard French TTS corpus but Tom's recording is warmer.it_IT-paola-medium— Italian. Piper's Italian shelf is shallow (only two voices, one of them x_low quality), and Paola is the only one we'd ship. It's not as polished as the German or English mediums but it's the best offline Italian neural voice that runs in a browser today.nl_NL-mls-medium— Dutch. Dutch is unusually well-served in Piper's catalog — ten voices spanning Netherlands and Belgian Flemish — andnl_NL-mls-mediumis the most consistent.nl_NL-pim-mediumandnl_BE-nathalie-mediumare both worth a listen if you want a male or a Flemish alternative.vi_VN-vivos-x_low— Vietnamese. Yes, x_low. Vietnamese is one of the languages where the OS TTS coverage is so thin that an x_low neural voice still beats the alternatives. Pair withvi_VN-vais1000-mediumfor the better-quality option when bandwidth allows.
The licensing minefield
Piper voices are published under a mix of licenses, and some of them are not commercial-OK. The catalog includes CC-BY 4.0, CC0, MIT, Apache 2.0, and several non-commercial-research-only voices. The license is in each voice's MODEL_CARD on huggingface — there's no central "is_commercial_ok" flag in voices.json itself.
Practical implication if you're shipping Piper in a product: don't enumerate voices.json at runtime and offer everything. Build an explicit allowlist and audit the licenses once. Quick TTS' config.js keeps a per-locale allowlist (AI_TTS_PIPER_VOICES_BY_LANG) that has been hand-checked against each model card; voices we couldn't verify as commercial-permissive don't appear, even when they sound great. If you're embedding Piper for a non-commercial or research project the bar is lower, but the same audit step is worth doing once so you know what you've shipped.
Why we don't ship every voice
The other reason Quick TTS' Piper allowlist is short: download budget. Each medium-quality Piper voice is roughly 60 MB. If we surfaced every English voice in the catalog (35+ at last count), the user's first interaction with the voice picker would either trigger 2 GB of background downloads or — more realistically — would be one of those frustrating UIs where every voice option has its own loading spinner and you can't tell which one will actually work without clicking and waiting.
Three voices per language is the sweet spot we landed on after a few iterations: enough variety that a user who doesn't like the default has a real alternative, few enough that the catalog feels curated rather than dumped. The cost of being more opinionated is that some genuinely good voices (looking at you, en_US-bryce-medium, en_US-kristin-medium) don't make the cut. If you have a strong feeling about one of them, the FAQ has a contact link and we read every request.
Quick TTS' actual allowlist, with reasoning
For full transparency, here's what config.js ships per locale and the one-line reason each voice is on the list. The allowlist comes from AI_TTS_PIPER_VOICES_BY_LANG:
| Locale | Voice IDs shipped | Why |
|---|---|---|
| en | en_US-libritts_r-medium, en_US-joe-medium, en_GB-vctk-medium |
Audiobook default + male alternative + British multi-speaker. Three voices covers the long tail without bloating the picker. |
| de | de_DE-thorsten-medium |
Best single-speaker open German neural voice in any catalog. We considered shipping the high tier but kept the medium for download-size reasons. |
| es | es_ES-davefx-medium, es_MX-claude-high |
Iberian + Mexican coverage. Argentine Spanish (daniela-high) is arguably better but adds a third voice; we left it out and may revisit. |
| fr | fr_FR-siwis-medium |
Most consistent French voice. fr_FR-tom-medium is warmer but the SIWIS prosody is more reliable for long-form content. |
| it | it_IT-paola-medium |
The only shippable Italian voice in the catalog at this tier. Piper's Italian shelf is genuinely thin. |
| nl | nl_NL-mls-medium |
Cleanest Netherlands Dutch. We deliberated on adding a Flemish (nl_BE) voice and decided one voice was enough for v1. |
| vi | vi_VN-vais1000-medium |
Best Vietnamese option, period. Piper has only three Vietnamese voices total; this is the only one above x_low. |
The full table covers 14 locales. For Japanese and Korean, config.js intentionally has no Piper entry — the Piper community has not produced viable voices for either language, and Quick TTS routes those locales straight to Kokoro (Japanese has 5 Kokoro voices) or Web Speech (Korean has 8 Edge Online voices). Honest caveat: Korean is the language Quick TTS covers least well, and we don't try to hide it.
Try them yourself
Open Quick TTS, switch the engine selector to Piper, and audition the shipped voices on whatever paragraph you're trying to listen to. The first time you select a voice it downloads the model (typically 5–10 seconds on a fast connection); every subsequent use is instant from browser cache.
If you want the broader picture on how Piper compares to the other browser TTS options, our deep-dive on Web Speech API vs Piper vs Kokoro has the side-by-side. The best-free-TTS-voices post is the higher-level "which voices should I bother with at all" version of this same exercise. The FAQ covers integration questions; the project home is at Quick TTS.