Web Speech API vs Piper vs Kokoro: Browser TTS Compared
Three engines can speak text in a browser today without a backend: the built-in Web Speech API, Piper compiled to WebAssembly, and Kokoro running on WebGPU. We ship all three in Quick TTS and let users hot-swap between them mid-sentence. Here's what we learned about when to reach for which.
The three engines, in one paragraph each
Web Speech API is the browser's built-in SpeechSynthesisUtterance. Zero bytes to download, works on every device that has a browser, and the audio comes out of whichever voices the operating system already has installed. Quality ranges from "robotic GPS" on a stock Windows install to "pretty good" on macOS or recent Edge.
Piper is a neural TTS toolkit from the Rhasspy project, originally for Raspberry Pi voice assistants. The browser path uses vits-web, an ONNX/WASM build of the Piper VITS models. Voices are roughly 30–60 MB each. CPU-only inference, but a single sentence renders in well under real-time on any modern laptop.
Kokoro-82M is an 82M-parameter StyleTTS2-derived model that punches well above its weight. The browser build uses kokoro-js on WebGPU through Transformers.js. The fp32 model is roughly 80 MB. On a desktop GPU it's the closest thing to a real audiobook narrator you can get without a paid API.
Web Speech API: the right default for "make it speak now"
Web Speech is the only engine with no install step. new SpeechSynthesisUtterance(text); window.speechSynthesis.speak(msg) and you're done. It's the right pick for accessibility helpers, form-error readbacks, and anything where time-to-first-audio matters more than naturalness.
It also has three traps that bite every shipping project:
- The Chrome-on-Windows 15-second pause bug. Long utterances stop dead at ~15s. The fix is a 6-second interval that calls
pause()thenresume()while speech is active. Ugly, but it works. Quick TTS'_startBumpCheckinspeech.jsdoes exactly this. - The ~32 KB silent drop. Chrome quietly drops utterances over roughly 32 KB. You have to chunk. Quick TTS targets ~300 characters per utterance — small enough that the next-segment latency stays tight, large enough that natural sentence boundaries fit.
- You cannot capture the audio. System-voice output is read-only by spec. There is no
MediaStream, nocaptureStream(), no offline render. If your user wants an MP3, Web Speech can't give it to them.
Audio quality is bounded by the OS. macOS and iOS ship Apple's premium voices. Windows ships Microsoft's neural voices via Edge but plain SAPI elsewhere. Android voices are controlled at OS level — you can't even enumerate them reliably from the page, which is why Quick TTS replaces the voice picker with a deep-link to system settings on Android.
Piper (WASM): offline-after-first-load, runs anywhere
Piper is the universal middle ground. WebAssembly is everywhere, so a Piper voice that downloaded once works from then on with no network — no GPU needed, no proprietary runtime, no API key. The voices we ship in Quick TTS (en_US-libritts_r-medium, en_GB-vctk-medium, and friends) sound roughly like a 2021-era Google Assistant: clearly synthetic but not painful.
What you trade for that universality:
- Cold start. First inference includes downloading the voice model (30–60 MB) plus the onnxruntime-web WASM (about 10 MB). On a fresh visit that's a 5–10 second delay before audio. Subsequent visits are instant — the browser cache holds it.
- Main-thread risk. CPU inference of a VITS model can stall the UI for hundreds of milliseconds per chunk if you run it inline. Quick TTS spins up a dedicated
piperWorker.jsWeb Worker, which keeps the main thread responsive but adds a serialization hop for every generated audio blob. - Voice licensing. Piper's catalog mixes permissive (CC-BY 4.0, MIT, Apache, CC0) and non-commercial voices. If you're embedding Piper, allowlist explicitly. Quick TTS keeps a per-locale allowlist in
config.jsrather than enumerating whatevervits-webreports.
Kokoro (WebGPU): the highest quality the browser can produce
Kokoro on WebGPU is the most pleasant of the three to listen to. It also has the narrowest reach — desktop Chrome, Edge, and Brave today, with stable Firefox WebGPU rolling out and Safari Tech Preview already shipping it. iOS and Android Kokoro is on the wrong side of a memory cliff: the ~80 MB ONNX file plus the runtime OOMs the tab in our testing, so Quick TTS hides the option on mobile.
The win, when the device can run it, is qualitative. Sentence-level prosody, breath control, even subtle questions-vs-statements intonation are noticeably better than Piper's. It's a model you'd actually choose to listen to a long article through, not just one you'd accept.
The catch is throughput. Generating audio for a paragraph takes 1–3 seconds on a discrete GPU. For a novel-length input that's intolerable if you do it serially. Quick TTS solves it with a pipelined batch loop in aiTts.js:
// While batch N plays, batch N+1 is already being generated.
let nextGenPromise = startGen(_aiBatches[0]);
while (_aiBatchIdx < _aiBatches.length) {
const blobs = await nextGenPromise; // wait for current
const nextIdx = _aiBatchIdx + 1;
if (nextIdx < _aiBatches.length) {
nextGenPromise = startGen(_aiBatches[nextIdx]); // pre-warm next
}
await _playBlobsGapless(blobs);
_aiBatchIdx = nextIdx;
}
Peak GPU memory stays at roughly two batches regardless of input length, and the user hears audio within a couple of seconds of pressing play. Chunk size is tuned smaller for Kokoro than for Web Speech (400 characters vs 300 — both engines need different sweet spots between latency and gap-free audio).
Decision matrix: which one to pick
If you're building your own browser TTS layer, here's the short version:
- Pick Web Speech if your priority is "audio in 50 ms, on any device, with zero install footprint" and you don't care about per-tab voice consistency. Accessibility readbacks, alerts, short notifications.
- Pick Piper if you need consistent voice across devices, must work offline after the first load, can't assume a GPU, and "decent but synthetic" is acceptable. Long-form reading on mid-range hardware, embedded apps, kiosks.
- Pick Kokoro if quality is the deciding factor, your users are on desktop, and you're willing to gracefully degrade for everyone else. Audiobook-style listening, content tools, anything where the voice is part of the product.
- Ship all three if you can't predict your users. Web Speech is the universal floor, Piper covers the mobile-WebGPU gap, Kokoro is the desktop ceiling. The wiring overhead is real but bounded — most of the complexity is in the live-handoff code below.
The handoff problem nobody talks about
The interesting bug surface in a multi-engine TTS isn't the engines themselves — it's what happens when the user switches mid-read. Toggle from Browser to Kokoro halfway through chapter four and the naive implementation either restarts from the top or stops dead.
Quick TTS handles this by tracking the last word boundary the active engine reached (onboundary.charIndex for Web Speech, the current batch index for AI engines) and handing the remaining text to the new engine. If Kokoro init fails after the user opted in, _fallbackSpeakText flips the active engine back to null and re-enters speakText() with the same remaining text, so the user hears Browser TTS finish the article. The fallback chain is Kokoro → Browser, Piper → Browser, never Browser → AI (Browser is always available, so it's the floor). Settings changes mid-read use the same remainder-handoff machinery, just without the engine swap.
Where this is going
WebGPU is the variable that resets this whole comparison every six months. Stable Firefox shipped it for desktop in 2024; Safari Tech Preview has it behind a flag; iOS 18 exposes a partial implementation that still OOMs on Kokoro-sized models but probably won't by 2027. When mobile WebGPU is reliable, Kokoro becomes the default for everyone and Piper becomes the legacy fallback for old browsers.
On the model side, expect Kokoro-class quality at half the size — the open-source TTS field is moving roughly as fast as image generation did in 2022–2023. Piper itself is gaining lighter-weight voices that close the quality gap, and there's an active effort to ship multilingual Kokoro (the current kokoro-js is English-only). The strategic bet for new browser TTS work: build for the engine-swap pattern, not for any single engine.
Try it
Open Quick TTS, paste a paragraph, and toggle through Browser, Piper, and Kokoro to hear all three on the same input. The engine selector is the dropdown next to the play controls; switching mid-playback hands off the remainder, so you can hear exactly where each engine takes over.
For the broader use cases — proofreading long writing, reading EPUBs aloud, accessibility tooling — the guide covers nine of them. The FAQ answers most of the integration questions we get from developers. The comparison page stacks Quick TTS against the paid alternatives. And our previous post on free EPUB-to-speech walks through the engine choice from a reader's perspective rather than a builder's.