← Back to Quick TTS

Why Piper-WASM Still Doesn't Work Reliably on iOS (and When It Might)

Piper compiled to WebAssembly should, in principle, run anywhere a browser does. iOS Safari supports WebAssembly. Therefore Piper should run on iPhone. We tried this in Quick TTS, watched it fail in three different ways, and ended up gating the AI engine off on mobile entirely. Here's what actually breaks, what we're doing about it today, and what would have to change at the platform level for it to be a real option.

What you'd expect

The story sounds plausible on paper. Piper exports its VITS models as ONNX. onnxruntime-web compiles to WASM. iOS Safari has shipped WebAssembly since iOS 11, supports SIMD since iOS 16.4, and runs reasonably standard HTML5 audio. A 60 MB voice model and a 10 MB runtime is not a large download by 2026 standards. Stick the inference loop in a Web Worker, pipe the resulting PCM into an AudioContext, ship.

That story is wrong in three specific places, and the failure modes compound. Each one would be solvable in isolation; together they make Piper-on-iOS the kind of feature that works for the developer testing it on Wi-Fi at full battery and fails for half the users who try it in the real world. Quick TTS shipped Piper on iOS for exactly one development cycle before pulling it back behind a PLATFORM.isMobile gate.

What actually happens, in three failure modes

1. The AudioContext autoplay restriction

iOS Safari's strictest audio rule is older than WebAssembly itself: AudioContext output is suspended by default, and the only way to resume it is from a synchronous handler running inside a user-gesture event (touchend, click). The rule applies regardless of where the audio data came from — even PCM you generated in a Web Worker and passed back to the main thread will not play unless the AudioContext was already unlocked at some prior moment of user interaction.

The trap: synthesizing audio successfully and then watching it never play. The console is silent. The worker reports a finished generation. The blob exists. Nothing comes out of the speaker. We hit this on the very first iOS test pass; the symptom looks identical to a broken model load, which is the wrong direction to spend an afternoon debugging.

Quick TTS' workaround lives in app.js as _unlockMobileAudio: every play-button click fires a silent oscillator and an empty SpeechSynthesisUtterance first, both of which count as "user-initiated audio output" and quietly unlock the AudioContext for the rest of the session. That fix works for Web Speech (which we ship), and it would work for Piper if the other two failure modes weren't waiting downstream.

2. The 100 MB tab memory cliff

iOS Safari has a per-tab memory limit that is much more aggressive than desktop browsers. The exact number is undocumented and varies by device — historically WebKit has kept WebAssembly heap allocations under roughly 256 MB on a 4 GB iPhone and gets stricter on lower-RAM devices (the iPhone 11 ships 4 GB, the iPhone 12 mini ships 4 GB, several iPad Air SKUs ship 4 GB). The WebKit team has not published a stable budget; it's a moving floor that drops further when other tabs or apps are competing for memory.

Loading a 60 MB Piper voice + the onnxruntime-web WASM heap (~80 MB resident during inference) + the page's own JS heap consistently lands in the danger zone. iOS doesn't return an error when a tab crosses the limit; it silently kills and reloads the tab. The user sees a fresh page, no console error, no event the page can listen for. Sometimes the kill happens during model load, sometimes during the first inference, sometimes minutes later when the user comes back to the tab.

We logged this with anonymized analytics and found the kill rate on iOS Safari for the brief window we shipped Piper there was somewhere in the high-twenties percent. On Android Chrome it was lower (around 8%) but still well above any reasonable bar. There's no graceful degradation path for "your tab might get killed at any moment" — the only fix is not to allocate that much memory to begin with.

3. The single-threaded WASM tax

Desktop ONNX inference for VITS-class models leans heavily on multithreading; onnxruntime-web spawns a thread pool sized to the device's logical core count and runs the matrix operations in parallel. WASM threads require SharedArrayBuffer, which in turn requires the page to be served with Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers (the COOP/COEP "cross-origin isolation" combo).

iOS Safari technically supports SharedArrayBuffer when those headers are set, but the support is fragile. Several iOS releases over the last three years have shipped with COOP/COEP regressions where the headers parse, the page becomes cross-origin-isolated by every test we can run, but navigator.hardwareConcurrency > 1 threads inside a Worker still execute serially. We've never been able to consistently reproduce a working multi-threaded inference pass on iOS, and the onnxruntime-web project's own issue tracker is dotted with reports matching that pattern.

The practical effect: a Piper inference call that takes 200 ms on desktop Chrome runs at 800 ms or longer on a recent iPhone. That's bad enough on its own; worse, single-threaded WASM blocks the main thread of the worker even though the worker itself is a separate thread, which means the AudioContext on the main thread can run dry while the worker is still computing. The audio audibly stutters in a way it doesn't on any other platform. iOS Safari's relatively low audio buffer size makes this worse, not better.

The same three failure modes, in one table

For comparison's sake, here's how each failure mode compounds across iOS Safari, Android Chrome, and desktop Chrome. The desktop column is what the code is implicitly designed for and where Piper actually works.

Piper-WASM failure modes by platform. Approximate, based on Quick TTS' brief mobile-Piper rollout.
Failure mode iOS Safari Android Chrome Desktop Chrome
AudioContext gating Strict; needs gesture-initiated unlock every session Strict; same gesture rule, less aggressive in practice Effectively absent; desktop Chrome unlocks on first interaction
Tab memory cliff ~256 MB undocumented, drops on low-RAM devices, silent kill ~512 MB practical ceiling, OOM error visible to the page ~4 GB+ on most desktops; not a constraint for an 80 MB model
WASM threads Flaky; SharedArrayBuffer present but threads often serialize Functional with COOP/COEP headers Functional; full thread pool available
Observed tab-kill rate during the rollout High twenties percent Around 8% Statistical noise (<1%)

The Android numbers are bad enough that we kept Piper gated off there too, but the platform issues are addressable in a way iOS' aren't — Android Chrome at least throws errors the page can catch, and the memory ceiling is roughly 2x iOS' on the same nominal RAM budget. If we re-enable mobile Piper, Android will be first.

What Quick TTS does about it (today)

The current behavior is straightforward: aiTtsPiperSupported() in aiTts.js only checks for WebAssembly support, but the upstream toggle in app.js hides the entire AI engine selector when PLATFORM.isMobile is true. Mobile users see Web Speech as the only engine, with their OS voices in the dropdown. The decision is made synchronously during initializeAiToggle() so there's no flash of an unsupported option.

On Android specifically, the voice <select> is replaced with a "Change ↗" link that deep-links into system accessibility settings (com.google.android.tts.MainActivity) — Android exposes voices at the OS level rather than per-tab, so a per-page picker would lie to the user about which voice is actually active. iOS keeps its voice picker because Apple's voices enumerate cleanly through Web Speech.

The handoff machinery elsewhere in the codebase (covered in our Web Speech vs Piper vs Kokoro post) means that even on desktop, when Kokoro init or Piper init fails, the remaining text is handed transparently back to Web Speech mid-read. Mobile gets the same engine as the desktop fallback, just without the option to opt up.

What needs to change for this to work

Three platform-level shifts would make Piper-on-iOS shippable. None of them are in our hands; all of them are tracked by people who do touch the relevant code.

  1. Stable WASM threads. WebKit needs to ship WASM threading that works reliably on COOP/COEP-isolated pages and that scales to the iPhone's actual core count. The infrastructure is there; the implementation has been flaky across several iOS releases.
  2. A predictable per-tab memory budget. Either a published number ("Safari guarantees 256 MB of WebAssembly heap on iPhone 12 and later") or a stable lifecycle event when the OS is about to reclaim a tab. Today neither exists. Without one, a page can either allocate conservatively and underperform, or allocate aggressively and get killed unpredictably.
  3. A tab-eviction event the page can listen for. Even if the memory budget stays tight, an event fired before the kill — analogous to visibilitychange or pagehide but specifically signaling memory pressure — would let a TTS app save its current chunk index, free the model, and resume from the same place when the user returns. Today the kill is silent and unsavable.

The instrumentation we wish we'd had earlier

If you're investigating mobile WASM crashes on your own product, the metrics that took us longest to wire up and helped most were:

iOS 18, iOS 19, and the actual timeline

Honest assessment of where the platform is right now:

What works on mobile today

For users on iOS or Android right now, Web Speech via the OS voices is the only practical browser TTS path. It works reliably, has near-zero latency, doesn't allocate enough memory to trigger any reclaim heuristics, and doesn't depend on threading at all (the OS handles synthesis in a separate process).

The voice quality varies by OS — recent Apple voices like Ava and Tom are very good, Microsoft's Edge Online voices are excellent on Windows, Android's Google TTS is utility-grade — but it's the ceiling for mobile until one of the three platform issues above is resolved. Quick TTS' Android voice deep-link to system accessibility settings is the next-best UX: rather than pretend to control voice selection from inside a tab, hand the user off to the OS where the picker actually does something.

If you must run Piper on iOS today

The brutal options, in order of how much we'd recommend each:

For the rest of us, the answer for now is: ship Web Speech on mobile, ship Piper and Kokoro on desktop, and revisit when WebKit moves. If you want the longer write-up on how Quick TTS handles all three engines together, our Web Speech vs Piper vs Kokoro post covers the architecture. The FAQ has the user-facing version of the "why doesn't this work on my iPhone" answer, and the guide walks through the use cases that drove these decisions. The product itself is at Quick TTS — open it on a desktop and the AI engine is there waiting.