kitten-tts-webgpu
Version:
Run Kitten TTS (80M) locally in the browser via WebGPU. One function call: textToSpeech('Hello!') → WAV blob.
181 lines (127 loc) • 6.48 kB
Markdown
[](https://www.npmjs.com/package/kitten-tts-webgpu)
[](./LICENSE)
**Pure WebGPU text-to-speech for the browser. 80M params, sub-second on desktop, ~1.2s on iPhone. No ONNX Runtime, no WASM inference — just 29 compute shaders. 753KB gzipped JS + model weights downloaded at runtime.**
[**Live Demo**](https://svenflow.github.io/kitten-tts-webgpu/) | [npm](https://www.npmjs.com/package/kitten-tts-webgpu) | [Model Card](https://huggingface.co/KittenML/kitten-tts-mini-0.8)
---
```bash
npm install kitten-tts-webgpu
```
```typescript
import { textToSpeech } from 'kitten-tts-webgpu';
const blob = await textToSpeech("The quick brown fox jumps over the lazy dog.");
const audio = new Audio(URL.createObjectURL(blob));
audio.play();
```
One function. Text in, WAV blob out (16-bit PCM, 24 kHz mono). The model downloads on first call and is cached for subsequent calls. Full TypeScript types included.
> **Note:** This library requires WebGPU. For server-side rendering frameworks (Next.js, Nuxt), dynamically import on the client side only.
## Size & Performance
### What gets downloaded
| | Size | When |
|-|------|------|
| **JS bundle** | **753 KB** gzipped (2.9 MB raw) | `npm install` / bundled into your app |
| **Model weights** | 24–78 MB (see below) | First `textToSpeech()` call, cached by browser |
The JS bundle includes the WebGPU engine, 29 compute shaders, and a 234K-word phonemizer dictionary. No WASM binaries, no ONNX Runtime.
### Models
Three [Kitten TTS v0.8](https://huggingface.co/KittenML) sizes, same API:
| Model | Params | Weights | M4 Pro (Chrome) | iPhone 17 Pro Max (Safari) |
|-------|--------|---------|------------------|----------------------------|
| **Mini** | 80M | 78 MB | 1.80s (3.3× RT) | ~1.2s |
| **Micro** | 40M | 41 MB | 1.05s (6.2× RT) | — |
| **Nano** | 15M | 24 MB | 0.93s (7.3× RT) | — |
*RT = real-time factor (audio duration ÷ generation time). Higher is better. Times are for warm generation (model already in GPU). First call adds ~2-4s for model download depending on connection.*
```typescript
await textToSpeech("Hello world"); // Default: nano (fastest, 24 MB)
await textToSpeech("Hello world", { model: 'micro' }); // Balanced (41 MB)
await textToSpeech("Hello world", { model: 'mini' }); // Best quality (78 MB)
```
```typescript
const blob = await textToSpeech("Welcome to the future.", {
voice: "Leo", // 8 voices: Bella, Luna, Rosie, Kiki, Jasper, Bruno, Hugo, Leo
speed: 1.2, // 0.5x – 2.0x
model: "micro", // mini | micro | nano
onProgress: (stage) => console.log(stage), // string: "Initializing WebGPU…", "Downloading…", "Generating speech…", etc.
});
```
| Female | Male |
|--------|------|
| Bella | Jasper |
| Luna | Bruno |
| Rosie | Hugo |
| Kiki | Leo |
```typescript
// Check for WebGPU support
if (!navigator.gpu) {
console.log("WebGPU not available — use Chrome 113+, Edge 113+, or Safari 26+");
}
// textToSpeech throws on:
// - No WebGPU support
// - Network error (model download fails)
// - Empty text input
try {
const blob = await textToSpeech("Hello");
} catch (err) {
console.error("TTS failed:", err.message);
}
```
For repeated generations or fine-grained control:
```typescript
import { KittenTTSEngine, textToInputIds, float32ToWav } from 'kitten-tts-webgpu';
const engine = new KittenTTSEngine();
await engine.init();
await engine.loadModel(onnxUrl, voicesUrl);
const { ids } = await textToInputIds("Hello world");
const { waveform } = await engine.generate(ids, "Bella", 1.0);
// waveform: Float32Array of 24kHz PCM samples
const wavBlob = float32ToWav(waveform, 24000);
```
29 hand-written [WGSL compute shaders](./src/shaders.ts) execute the full TTS pipeline on GPU:
```
Text → Phonemes (234K-word dictionary + espeak rules in pure JS)
→ ALBERT encoder (embedding, multi-head attention, FFN)
→ Duration predictor (LSTM + CNN)
→ Acoustic decoder (LSTM + AdaIN + CNN, style-conditioned)
→ HiFi-GAN vocoder (ConvTranspose1d, Snake activations, iSTFT)
→ 24kHz WAV
```
**Why not ONNX Runtime Web?**
Most browser TTS uses ONNX Runtime Web (~2MB WASM binary + C++ runtime). This project takes a different approach:
- **Custom ONNX parser** — dequantizes int8/uint8/float16 weights in pure TypeScript, no C++ runtime
- **234K-word phonemizer** — espeak-ng rules ported to pure JS (WASM espeak hangs on iOS Safari)
- **GPU buffer pooling** — reuses buffers across HiFi-GAN iterations, ~130MB peak on mobile
- **Dynamic architecture** — detects model dimensions from weight shapes, one engine for all 3 sizes
## Browser Support
| Browser | Status |
|---------|--------|
| Chrome 113+ | ✅ |
| Edge 113+ | ✅ |
| Safari 26+ (macOS/iOS) | ✅ |
| Firefox Nightly | Experimental |
## FAQ
**Max input length?** Recommended under ~500 characters per call. For longer text, split into sentences.
**Languages?** English only (matches the upstream Kitten TTS model).
**Offline?** Yes, after the model is cached in the browser. No server needed for inference.
**Self-hosting models?** Pass custom URLs to `KittenTTSEngine.loadModel(onnxUrl, voicesUrl)`.
**Bundle size?** 753KB gzipped (2.9MB raw). Includes engine, 29 compute shaders, and 234K-word phonemizer dictionary. Model weights (24–78MB depending on model size) are downloaded separately at runtime on first call and cached by the browser.
**Model license?** Kitten TTS models are released under [Apache 2.0](https://huggingface.co/KittenML/kitten-tts-mini-0.8). Code in this repo is MIT.
## Development
```bash
git clone https://github.com/svenflow/kitten-tts-webgpu.git
cd kitten-tts-webgpu
npm install
npm run dev # Dev server
npm run build # Production build
npm test # Phonemizer tests
```
## Credits
- [Kitten TTS](https://huggingface.co/KittenML/kitten-tts-mini-0.8) models by KittenML (Apache 2.0)
- [espeak-ng](https://github.com/espeak-ng/espeak-ng) pronunciation dictionary and letter-to-sound rules (GPL-3.0, bundled as data files)
- [phonemizer](https://www.npmjs.com/package/phonemizer) by Xenova (espeak-ng WASM, used as primary backend on Chrome/Firefox; pure JS fallback on Safari)
## License
MIT