Kokoro-82M in the Browser: WebGPU Benchmarks Across NVIDIA, AMD, and Apple Silicon
Kokoro-82M is the smallest open TTS model that genuinely sounds like a human reader, and the kokoro-js port runs it directly in the browser via WebGPU. The natural question — how fast is "directly in the browser", actually? — has roughly zero published numbers. So we measured. Here's what an 82M-parameter neural TTS looks like on a handful of consumer GPUs, with all the caveats spelled out up front.
Why a browser benchmark for an 82M-parameter model
Kokoro-82M is small by 2026 standards. Llama-class language models are at least an order of magnitude bigger; even Whisper-tiny (39M) and Whisper-base (74M) are in the same neighborhood. That smallness is the whole reason it ships in a browser: the fp32 ONNX export is roughly 80 MB, which is downloadable, cacheable, and fits in the WebGPU memory budget on almost any laptop with a discrete or integrated GPU made in the last five years.
But "fits in memory" is not the same as "runs at usable speed". A model can technically execute on Intel UHD integrated graphics and still take 12 seconds to produce one second of audio — at which point the experience is worse than not having neural TTS at all. We wanted to know where the cliff is. Quick TTS ships Kokoro behind a desktop-only flag, falls back to Piper or Web Speech elsewhere, and the cutoff for "should we even offer it" is exactly the question this post is trying to answer.
What we measured
For each GPU we ran the same simple harness: paste a fixed test paragraph, hit play, time the relevant events with performance.now(). Three input sizes, 200, 500, and 1000 characters, each in cold and warm variants:
- Cold load — empty browser cache, model downloaded fresh from
cdn.jsdelivr.netandhuggingface.co, WebGPU device acquired, ONNX session initialized. Network is the dominant variable here, so we report it on a 200 Mbit home connection. - TTFA (time-to-first-audio) — the gap between play-button click and the first PCM sample reaching the audio output. Measured warm (model already cached and in GPU memory) for the 200- and 1000-character inputs.
- Throughput — total seconds of audio produced divided by total wall-clock seconds spent generating. Numbers above 1.0 mean the model is faster than real-time; below 1.0 means the audio plays out faster than it can be generated and there will be gaps without pipelining.
- Peak RAM — observed in Chrome's task manager during a fresh 1000-character run. WebGPU and JS heap combined; not instrumented, not lab-grade.
All Chromium-based: Chrome 134 stable on the desktop and laptop machines, Chrome 134 stable on the M2 and M3 Macs, Edge 134 on the Windows machines (we checked one run on each — Edge's numbers were within ~5% of Chrome's, so the table reports Chrome). All voices were Kokoro's default af_heart. Model URL is onnx-community/Kokoro-82M-ONNX, identical across machines.
The numbers
Single-day, single-tester, browser task-manager memory rather than instrumented profiling. Treat as orientation, not lab data. Numbers below are approximate to the nearest 50 ms / 50 MB.
| GPU | Cold load (s) | TTFA — 200 ch (ms) | TTFA — 1000 ch (ms) | Throughput (audio s / wall s) | Peak RAM |
|---|---|---|---|---|---|
| RTX 4070 desktop (12 GB, PCIe 4) | ~7 | ~300 | ~1100 | ~6.5× | ~340 MB |
| RTX 3060 laptop (6 GB, mobile) | ~8 | ~500 | ~1900 | ~3.8× | ~360 MB |
| M3 Pro MacBook Pro (18-core GPU, 18 GB unified) | ~7 | ~600 | ~2200 | ~3.2× | ~330 MB |
| M2 MacBook Air (8-core GPU, 16 GB unified) | ~9 | ~750 | ~2900 | ~2.4× | ~340 MB |
| Radeon RX 7600 desktop (8 GB) | ~9 | ~700 | ~2700 | ~2.6× | ~370 MB |
| Intel Arc A380 (6 GB) | ~12 | ~1300 | ~5000 | ~1.3× | ~390 MB |
| Intel UHD 770 integrated (shared system RAM) | ~18 | ~3500 | occasional OOM / kill | ~0.6× when it completes | ~520 MB (often the last reading before crash) |
A few patterns are stable across re-runs even if the absolute numbers wobble. NVIDIA discrete is consistently the fastest tier, with the desktop RTX 4070 producing roughly six seconds of audio per wall-clock second on long inputs. Apple Silicon is closer to mid-range NVIDIA than to high-end NVIDIA — the M3 Pro lands near the RTX 3060 laptop, and the M2 Air is roughly Radeon RX 7600 territory. Intel Arc A380 is usable but tight (you'll feel the latency on the first chunk). Intel UHD integrated is a coin flip — sometimes it works, sometimes Chrome kills the WebGPU process when the model load + the page heap collide with the graphics-shared system RAM allocation.
What changes when input gets long
TTFA scales roughly linearly with input length, which is what you'd expect from an autoregressive-ish decoder operating on a fixed token rate. The 200-character measurement captures one chunk of generation; the 1000-character measurement captures roughly five. Doubling input length doubles the wait — unless you pipeline.
| Input length | TTFA, naive single-chunk | TTFA, pipelined batches |
|---|---|---|
| 200 chars | ~300 ms | ~300 ms |
| 500 chars | ~600 ms | ~350 ms |
| 1000 chars | ~1100 ms | ~400 ms |
| 5000 chars (one short article) | ~5500 ms | ~450 ms |
The pipelined number stays roughly flat because the user only ever waits for the first batch to finish. Subsequent batches generate in the background while playback is in progress. Beyond ~500 characters the pipeline is fully fed, and TTFA is bounded by the time to render the first ~400 characters regardless of whether the input is one paragraph or one chapter.
When WebGPU init fails
Three failure modes account for almost every "Kokoro didn't start" report we see:
- Browser doesn't expose WebGPU. Firefox stable shipped WebGPU on desktop in 2024 but kokoro-js still requires a few extensions Firefox hasn't enabled by default; iOS Safari doesn't expose it stably either. The kokoro-js promise rejects at
navigator.gpu.requestAdapter(). Quick TTS catches that, flips_aiReadyfalse, and re-enters Web Speech for the remainder of the text. - Adapter exists but allocation fails. Common on integrated graphics where the driver advertises WebGPU support but won't grant the ~250 MB of GPU memory the model needs. Symptom: the adapter resolves, the device acquires, and then the first inference call throws
OOMError. Same fallback path. - iOS / Android memory cliff. Even where WebGPU is exposed, mobile Safari and mobile Chrome aggressively reclaim tab memory once a single tab passes ~250 MB of resident heap. Loading the Kokoro ONNX file plus the runtime can trip that limit silently — the user sees a tab reload, no error in the console. We hide the Kokoro toggle on mobile entirely rather than offer a feature that fails 30% of the time.
How Quick TTS keeps peak memory bounded regardless of input length
The pipelined batch loop in aiTts.js is what makes pasting a novel-length input feel responsive on a mid-range GPU. Pseudo-code, lifted from the actual implementation:
// While batch N plays, batch N+1 is already being generated.
let nextGenPromise = startGen(_aiBatches[0]);
while (_aiBatchIdx < _aiBatches.length) {
const blobs = await nextGenPromise; // wait for current
const nextIdx = _aiBatchIdx + 1;
if (nextIdx < _aiBatches.length) {
nextGenPromise = startGen(_aiBatches[nextIdx]); // pre-warm next
}
await _playBlobsGapless(blobs);
_aiBatchIdx = nextIdx;
}
Two things fall out of this. First, the user hears audio after one batch's worth of generation, no matter how long the total input is — the first batch is roughly 400 characters and renders in well under a second on any GPU at or above an RTX 3060 laptop tier. Second, peak memory stays bounded at "two batches' worth of audio blobs plus the model weights", because once a batch finishes playing it's dereferenced and the GC reclaims it before the next one is scheduled. We've fed a full 50,000-character chapter into Quick TTS on an RTX 3060 laptop and watched Chrome's task manager hold steady at around 360 MB the entire time.
Reproducing these numbers
Three tools are enough to reproduce most of the table on your own hardware:
chrome://gputells you whether WebGPU is enabled and which adapter Chrome picked. If "WebGPU: Hardware accelerated" isn't green, no benchmark you run will be representative.performance.now()instrumentation around the kokoro-jsgenerate()call and around the firstaudioContext.decodeAudioData()success. The TTFA number above is the gap between those two timestamps, with the play button click as t=0.- Chrome devtools → Memory → Take heap snapshot at three points (before play, after first batch, after full input completes) gives you peak JS heap. WebGPU memory itself isn't visible from devtools — Chrome's task manager is the only practical reading and it lumps GPU and CPU memory together.
If your numbers diverge meaningfully from ours we'd like to know — especially on the long tail of integrated GPUs and on AMD APUs we didn't have access to. Network speed dominates cold-load timings, so report your throughput / TTFA on a warm cache for fair comparison.
Try it
Open Quick TTS on a desktop browser, switch the engine selector to Kokoro, paste any English text, and watch the play button. The first audio should land within roughly the TTFA your GPU rates above. If you have a slower or unusual GPU, the engine toggle hands the rest of the text to Piper or Web Speech with no replay or restart — the same handoff machinery covers the engine-not-supported case.
For the bigger picture on when each engine is the right pick, our deep-dive on Web Speech API vs Piper vs Kokoro is the longer read. The FAQ answers most "why doesn't Kokoro show up on my Mac mini" support questions, and the guide walks through the broader use cases that make a per-GPU benchmark relevant in the first place. The comparison page stacks Quick TTS against the paid alternatives if you'd rather hand the GPU work to someone else's servers.