phonemize
Version:
Fast phonemizer with rule-based G2P prediction. Pure JavaScript implementation.
239 lines (171 loc) • 7.11 kB
Markdown
[](https://github.com/hans00/phonemize/actions/workflows/ci.yml)
[](https://codecov.io/gh/hans00/phonemize)
[](https://badge.fury.io/js/phonemize)
[](https://opensource.org/licenses/MIT)
[](https://nodejs.org/)
Fast phonemizer with rule-based G2P (Grapheme-to-Phoneme) prediction.
Pure JavaScript implementation with no native dependencies.
Inspired by [ttstokenizer](https://github.com/neuml/ttstokenizer)
## Features
- ⚡ **Lightning fast** - Pure rule-based processing, no ML overhead
- 🎯 **Intelligent compound word support** - Automatic decomposition of complex words
- 📚 **Comprehensive dictionary** - 125,000+ word pronunciations
- 🧠 **Smart rule-based G2P** - Advanced phonetic rules for unknown words
- 🌍 **Multiple formats** - IPA and ARPABET output
- 🌐 **Multilingual support** - Chinese, Japanese, Korean and more via anyAscii
- 💻 **Pure JavaScript** - No native dependencies, works everywhere
- 🔧 **Simple API** - Easy to integrate and use
## Installation
```bash
npm install phonemize
```
## Quick Start
```javascript
import { phonemize, toIPA, toARPABET } from 'phonemize'
// Default IPA output
console.log(phonemize('Hello world!'))
// Output: həˈɫoʊ ˈwɝɫd!
// ARPABET format
console.log(toARPABET('Hello world!'))
// Output: HH AX EL1 OW W1 ER EL D!
```
Automatically detects and decomposes compound words:
```javascript
phonemize('supercar') // → ˈsupɝˈkɑɹ (super + car)
phonemize('playground') // → ˈpɫeɪˌɡɹaʊn (play + ground)
phonemize('superman') // → ˈsupɝˌmæn (super + man)
```
### Multi-Compound Words
Handles extremely long compound words intelligently:
```javascript
phonemize('supercalifragilisticexpialidocious')
phonemize('antidisestablishmentarianism')
phonemize('pneumonoultramicroscopicsilicovolcanoconiosss')
```
## Multilingual Support
Supports multiple languages through anyAscii transliteration:
```javascript
// Chinese (direct processing with tone numbers)
phonemize('你好世界') // → ni˧˥ xɑʊ˨˩˦ ʂɻ̩˥˩ tɕiɛ˥˩
phonemize('北京') // → peɪ˧˩˧ tɕiŋ˥˥
// Japanese (with anyAscii and rule-based processing)
phonemize('こんにちは', { anyAscii: true }) // → konnitɕiwa
phonemize('東京', { anyAscii: true }) // → tʊŋ˥˥ tɕiŋ˥˥
// Korean (with anyAscii and rule-based processing)
phonemize('안녕하세요', { anyAscii: true }) // → ʔannjʌŋhaseyo
phonemize('서울', { anyAscii: true }) // → ˈsoʊɫ
// Other languages fallback to English G2P after anyAscii
phonemize('Привет', { anyAscii: true }) // → ˈpɹaɪvɛt
```
Note: anyascii only ensures an approximation and is likely not the correct pronunciation
## API Reference
### Basic Functions
#### `phonemize(text, options?)`
Convert text to phonemes.
```javascript
phonemize('Hello world!') // IPA string
phonemize('Hello world!', { returnArray: true }) // IPA array
```
**Options:**
- `returnArray` (boolean): Return array instead of string
- `format` ('ipa' | 'arpabet'): Output format
- `stripStress` (boolean): Remove stress markers
- `separator` (string): Phoneme separator (default: ' ')
- `anyAscii` (boolean): Enable multilingual support via anyAscii transliteration
#### `toIPA(text, options?)`
Convert text to IPA phonemes.
```javascript
toIPA('Hello world!') // "həˈɫoʊ ˈwɝɫd!"
```
#### `toARPABET(text, options?)`
Convert text to ARPABET phonemes.
```javascript
toARPABET('Hello world!') // "HH AX L OW1 W ER1 L D!"
```
#### `toZhuyin(text, options?)`
Convert text to Zhuyin (Bopomofo / 注音) format.
This function is specifically designed for Chinese text. Non-Chinese text will be phonemized to IPA as a fallback.
**Note:** The output format is `Zhuyin + tone number` (e.g., `ㄓㄨㄥ1 ㄨㄣ2`), which is optimized for **Kokoro**.
```javascript
import { toZhuyin } from 'phonemize';
toZhuyin('中文'); // "ㄓㄨㄥ1 ㄨㄣ2"
toZhuyin('你好世界'); // "ㄋㄧ3 ㄏㄠ3 ㄕ4 ㄐㄧㄝ4"
toZhuyin('中文 and English'); // "ㄓㄨㄥ1 ㄨㄣ2 ænd ˈɪŋɡlɪʃ"
```
```javascript
import { addPronunciation } from 'phonemize'
// Add custom word pronunciation
addPronunciation('myword', 'ˈmaɪwərd') // Can be IPA or ARPABET
console.log(phonemize('myword')) // "ˈmaɪwərd"
```
```javascript
import { Tokenizer, createTokenizer } from 'phonemize'
// Create custom tokenizer
const tokenizer = createTokenizer({
format: 'ipa',
stripStress: true,
separator: '-'
})
// Tokenize with detailed info
const tokens = tokenizer.tokenizeToTokens('Hello world!')
// [
// { phoneme: "həɫoʊ", word: "Hello", position: 0 },
// { phoneme: "wɝɫd", word: "world", position: 6 }
// ]
```
Numbers are automatically converted to words:
```javascript
phonemize('I have 123 apples')
// "ˈaɪ ˈhæv ˈwən ˈhəndɝd ˈtwɛni ˈθɹi ˈæpəɫz"
```
### Abbreviation Expansion
Common abbreviations are expanded:
```javascript
phonemize('Dr. Smith and Mr. Johnson')
// "ˈdɑktɝ ˈsmɪθ ˈænd ˈmɪstɝ ˈdʒɑnsən"
```
### Currency and Dates
Special handling for currency and dates:
```javascript
phonemize('15 dollars in 2023')
// "ˈfɪfˈtin ˈdɑɫɝz ˈɪn ˈtwɛni ˈtwɛni ˈθɹi"
```
## Performance
- **Dictionary lookup**: O(1) - Instant for known words
- **Rule-based processing**: Extremely fast, no model loading
- **Compound decomposition**: Efficient balanced search algorithm
- **Memory efficient**: Compressed JSON dictionaries only
- **Zero startup time**: No model initialization required
Typical performance: **>10000 words/second** on modern hardware.
## Processing Pipeline
1. **Language Detection** - Detect language before anyAscii conversion (if enabled)
2. **anyAscii Transliteration** - Convert non-Latin scripts to ASCII (if enabled)
3. **Dictionary Lookup** - Check for exact word match
4. **Multilingual Processing** - Handle Chinese, Japanese, Korean, etc.
5. **Compound Detection** - Intelligent decomposition of compound words
6. **Multi-Compound Handling** - Special processing for very long compounds
7. **Rule-Based G2P** - Apply phonetic rules for unknown words
Note: The rule based G2P is LLM generated, may error generate. Best practice is use custom pronunciation for unknown words.
## Supported Phoneme Sets
### IPA (International Phonetic Alphabet)
Standard IPA symbols for English phonemes with stress marks.
### ARPABET
CMU ARPABET phoneme set with stress numbers (0,1,2).
## Building from Source
```bash
# Install dependencies
yarn
# Compile TypeScript and dictionaries
yarn build
# Run tests
yarn test
```
## License
MIT