← Back to Quick TTS

Kokoro-82M in the Browser: WebGPU Benchmarks Across NVIDIA, AMD, and Apple Silicon

Kokoro-82M is the smallest open TTS model that genuinely sounds like a human reader, and the kokoro-js port runs it directly in the browser via WebGPU. The natural question — how fast is "directly in the browser", actually? — has roughly zero published numbers. So we measured. Here's what an 82M-parameter neural TTS looks like on a handful of consumer GPUs, with all the caveats spelled out up front.

Why a browser benchmark for an 82M-parameter model

Kokoro-82M is small by 2026 standards. Llama-class language models are at least an order of magnitude bigger; even Whisper-tiny (39M) and Whisper-base (74M) are in the same neighborhood. That smallness is the whole reason it ships in a browser: the fp32 ONNX export is roughly 80 MB, which is downloadable, cacheable, and fits in the WebGPU memory budget on almost any laptop with a discrete or integrated GPU made in the last five years.

But "fits in memory" is not the same as "runs at usable speed". A model can technically execute on Intel UHD integrated graphics and still take 12 seconds to produce one second of audio — at which point the experience is worse than not having neural TTS at all. We wanted to know where the cliff is. Quick TTS ships Kokoro behind a desktop-only flag, falls back to Piper or Web Speech elsewhere, and the cutoff for "should we even offer it" is exactly the question this post is trying to answer.

What we measured

For each GPU we ran the same simple harness: paste a fixed test paragraph, hit play, time the relevant events with performance.now(). Three input sizes, 200, 500, and 1000 characters, each in cold and warm variants:

All Chromium-based: Chrome 134 stable on the desktop and laptop machines, Chrome 134 stable on the M2 and M3 Macs, Edge 134 on the Windows machines (we checked one run on each — Edge's numbers were within ~5% of Chrome's, so the table reports Chrome). All voices were Kokoro's default af_heart. Model URL is onnx-community/Kokoro-82M-ONNX, identical across machines.

The numbers

Single-day, single-tester, browser task-manager memory rather than instrumented profiling. Treat as orientation, not lab data. Numbers below are approximate to the nearest 50 ms / 50 MB.

Kokoro-82M (af_heart) — warm time-to-first-audio and throughput across consumer GPUs. Chrome 134, kokoro-js 1.2.1, May 2026. Approximate.
GPU Cold load (s) TTFA — 200 ch (ms) TTFA — 1000 ch (ms) Throughput (audio s / wall s) Peak RAM
RTX 4070 desktop (12 GB, PCIe 4) ~7 ~300 ~1100 ~6.5× ~340 MB
RTX 3060 laptop (6 GB, mobile) ~8 ~500 ~1900 ~3.8× ~360 MB
M3 Pro MacBook Pro (18-core GPU, 18 GB unified) ~7 ~600 ~2200 ~3.2× ~330 MB
M2 MacBook Air (8-core GPU, 16 GB unified) ~9 ~750 ~2900 ~2.4× ~340 MB
Radeon RX 7600 desktop (8 GB) ~9 ~700 ~2700 ~2.6× ~370 MB
Intel Arc A380 (6 GB) ~12 ~1300 ~5000 ~1.3× ~390 MB
Intel UHD 770 integrated (shared system RAM) ~18 ~3500 occasional OOM / kill ~0.6× when it completes ~520 MB (often the last reading before crash)

A few patterns are stable across re-runs even if the absolute numbers wobble. NVIDIA discrete is consistently the fastest tier, with the desktop RTX 4070 producing roughly six seconds of audio per wall-clock second on long inputs. Apple Silicon is closer to mid-range NVIDIA than to high-end NVIDIA — the M3 Pro lands near the RTX 3060 laptop, and the M2 Air is roughly Radeon RX 7600 territory. Intel Arc A380 is usable but tight (you'll feel the latency on the first chunk). Intel UHD integrated is a coin flip — sometimes it works, sometimes Chrome kills the WebGPU process when the model load + the page heap collide with the graphics-shared system RAM allocation.

What changes when input gets long

TTFA scales roughly linearly with input length, which is what you'd expect from an autoregressive-ish decoder operating on a fixed token rate. The 200-character measurement captures one chunk of generation; the 1000-character measurement captures roughly five. Doubling input length doubles the wait — unless you pipeline.

TTFA scaling on RTX 4070 desktop, single-chunk vs. pipelined batch loop. Approximate, Chrome 134.
Input length TTFA, naive single-chunk TTFA, pipelined batches
200 chars ~300 ms ~300 ms
500 chars ~600 ms ~350 ms
1000 chars ~1100 ms ~400 ms
5000 chars (one short article) ~5500 ms ~450 ms

The pipelined number stays roughly flat because the user only ever waits for the first batch to finish. Subsequent batches generate in the background while playback is in progress. Beyond ~500 characters the pipeline is fully fed, and TTFA is bounded by the time to render the first ~400 characters regardless of whether the input is one paragraph or one chapter.

When WebGPU init fails

Three failure modes account for almost every "Kokoro didn't start" report we see:

How Quick TTS keeps peak memory bounded regardless of input length

The pipelined batch loop in aiTts.js is what makes pasting a novel-length input feel responsive on a mid-range GPU. Pseudo-code, lifted from the actual implementation:

// While batch N plays, batch N+1 is already being generated.
let nextGenPromise = startGen(_aiBatches[0]);
while (_aiBatchIdx < _aiBatches.length) {
    const blobs = await nextGenPromise;       // wait for current
    const nextIdx = _aiBatchIdx + 1;
    if (nextIdx < _aiBatches.length) {
        nextGenPromise = startGen(_aiBatches[nextIdx]); // pre-warm next
    }
    await _playBlobsGapless(blobs);
    _aiBatchIdx = nextIdx;
}

Two things fall out of this. First, the user hears audio after one batch's worth of generation, no matter how long the total input is — the first batch is roughly 400 characters and renders in well under a second on any GPU at or above an RTX 3060 laptop tier. Second, peak memory stays bounded at "two batches' worth of audio blobs plus the model weights", because once a batch finishes playing it's dereferenced and the GC reclaims it before the next one is scheduled. We've fed a full 50,000-character chapter into Quick TTS on an RTX 3060 laptop and watched Chrome's task manager hold steady at around 360 MB the entire time.

Reproducing these numbers

Three tools are enough to reproduce most of the table on your own hardware:

If your numbers diverge meaningfully from ours we'd like to know — especially on the long tail of integrated GPUs and on AMD APUs we didn't have access to. Network speed dominates cold-load timings, so report your throughput / TTFA on a warm cache for fair comparison.

Try it

Open Quick TTS on a desktop browser, switch the engine selector to Kokoro, paste any English text, and watch the play button. The first audio should land within roughly the TTFA your GPU rates above. If you have a slower or unusual GPU, the engine toggle hands the rest of the text to Piper or Web Speech with no replay or restart — the same handoff machinery covers the engine-not-supported case.

For the bigger picture on when each engine is the right pick, our deep-dive on Web Speech API vs Piper vs Kokoro is the longer read. The FAQ answers most "why doesn't Kokoro show up on my Mac mini" support questions, and the guide walks through the broader use cases that make a per-GPU benchmark relevant in the first place. The comparison page stacks Quick TTS against the paid alternatives if you'd rather hand the GPU work to someone else's servers.