kusamoji
Version:
Japanese morphological analyzer for Node.js — Viterbi tokenizer with mmap dict loading and pluggable POS-source strategy
282 lines (207 loc) • 12.7 kB
Markdown
# Kusamoji 草文字
Segments Japanese text into morphemes and attaches part of speech, reading, and pronunciation metadata.
## Features
- **Viterbi tokenization** with IPADIC/NEologd dictionary support
- **Custom dictionary** — bring your own IPADIC/NEologd `.dat` files
- **OS-level native dict loading** — loads dictionary via memory-mapped I/O for near-instant boot (~1s vs ~4s) and OS-managed page cache
- **Automatic memory management** — lets the OS handle page caching; no manual tuning needed
- **Viterbi length bonus** — prevents short dictionary fragments from stealing prefixes of longer correct matches
- **Zero-copy TypedArray access** to binary dictionary data
## Install
```bash
pnpm install kusamoji
# or
npm add kusamoji
```
### How the native mmap addon works
kusamoji ships pre-compiled mmap binaries for common platforms. The addon is **optional** — kusamoji works without it, just with slower boot and higher RAM.
**You don't need to do anything.** On first use, kusamoji automatically:
1. Finds the matching prebuilt binary inside the package (`src/native/prebuilds/`)
2. Copies it to `~/.kusamoji/` for persistence across reinstalls
3. Loads it — mmap dict loading is now active
If no prebuilt matches your platform, kusamoji silently falls back to `fs.readFile`. Everything works — the mmap addon is a performance optimization, not a requirement.
### Shipped prebuilts
| Platform | Architecture | Status |
| -------- | --------------------- | -------------------------- |
| macOS | Apple Silicon (arm64) | ✅ Shipped |
| macOS | Intel (x64) | Compile from source |
| Linux | x64 (Intel/AMD) | ✅ Shipped |
| Linux | arm64 (Graviton, RPi) | ✅ Shipped |
| Windows | any | Not supported (POSIX only) |
### Troubleshooting the native addon
**"I installed kusamoji but I'm not sure if mmap is active"**
```bash
node -e "
const path = require('path');
const loader = require(path.join(require.resolve('kusamoji'), '..', 'native', 'loader.js'));
const addon = loader.loadMmapAddon();
console.log(addon ? 'mmap is ACTIVE' : 'mmap is NOT active (using fs.readFile fallback)');
"
```
**"pnpm install didn't set up the addon"**
This is normal. pnpm may skip `postinstall` scripts for security. The addon is loaded lazily on first use — no manual setup needed. If you want to pre-warm the cache:
```bash
pnpm rebuild kusamoji
```
**"I want to compile the addon from source"**
For platforms without a shipped prebuilt, or if you want to rebuild:
```bash
npx kusamoji rebuild-native
```
Requires: C compiler (gcc/clang), Python 3. The compiled binary is cached at `~/.kusamoji/` and persists across `pnpm install` cycles.
**"I'm on an unsupported platform"**
kusamoji falls back to `fs.readFile` automatically. Dictionary loading still works — boot is ~3-4s instead of ~1s, and RAM is higher (~2.5 GB vs ~1.4 GB for NEologd). No action needed.
### Binary cache directory (`~/.kusamoji/`)
The native addon binary is cached at `~/.kusamoji/` along with a `config.json` metadata file. This cache:
- Survives `pnpm install` / `npm install` cycles
- Is validated against your Node.js N-API version on each load
- Is automatically refreshed when you upgrade Node.js to a new major version
- Can be safely deleted — it will be recreated on next use
## Quick Start
```js
const kusamoji = require('kusamoji')
const tokenizer = await kusamoji.builder({ dicPath: '/path/to/dict' }).buildAsync()
const tokens = tokenizer.tokenize('大谷翔平がロサンゼルス・ドジャースで3本塁打を放った')
for (const token of tokens) {
console.log(token.surface_form, token.reading, token.pos)
}
// 大谷翔平 オオタニショウヘイ 名詞
// が ガ 助詞
// ロサンゼルス ロサンゼルス 名詞
// ・ ・ 記号
// ドジャース ドジャース 名詞
// で デ 助詞
// 3 サン 名詞
// 本塁打 ホンルイダ 名詞
// を ヲ 助詞
// 放っ ハナッ 動詞
// た タ 助動詞
```
### More examples
Dates, counters, and proper nouns are resolved natively from the dictionary — no preprocessing needed:
```js
tokenizer.tokenize('2026年4月9日、川崎市の製鉄所で作業員が転落する事故が発生した')
// 2026年 ニセンニジュウロクネン 名詞 ← full year reading
// 4月9日 シガツココノカ 名詞 ← month + day as one token
// 、 、 記号
// 川崎市 カワサキシ 名詞 ← place name
// の ノ 助詞
// 製鉄所 セイテツジョ 名詞 ← rendaku: 所(ショ→ジョ)
// で デ 助詞
// 作業員 サギョウイン 名詞
// が ガ 助詞
// 転落 テンラク 名詞
// する スル 動詞
// 事故 ジコ 名詞
// が ガ 助詞
// 発生 ハッセイ 名詞
// し シ 動詞
// た タ 助動詞
tokenizer.tokenize('藤井聡太名人は第84期将棋名人戦で圧倒的な強さを見せた')
// 藤井聡太 フジイソウタ 名詞 ← NEologd proper noun
// 名人 メイジン 名詞
// は ハ 助詞
// 第 ダイ 接頭詞
// 84期 ハチジュウヨンキ 名詞 ← digit+counter compound
// 将棋 ショウギ 名詞
// 名人戦 メイジンセン 名詞
// で デ 助詞
// 圧倒的 アットウテキ 名詞
// な ナ 助動詞
// 強 ツヨ 形容詞
// さ サ 名詞
// を ヲ 助詞
// 見せ ミセ 動詞
// た タ 助動詞
```
## Benchmarks
All numbers measured on Apple M1 Pro, Node.js 22, NEologd dictionary (6.1M entries, 1.4 GB uncompressed). Methodology: 700 real-world Japanese news snippets × 9 conversion variants = 6,300 HTTP calls end-to-end through an Express service.
### Cold start
| Mode | Boot time | Ready for first query |
| ------------ | --------: | --------------------------------------------------- |
| **kusamoji** | **1.0 s** | Dictionary memory-mapped, OS demand-pages on access |
| kuromoji.js | 8–12 s | gunzip + parse all 12 .dat.gz files |
### Runtime memory (RSS)
| Mode | Idle RSS | Under load (700 concurrent) | Peak |
| ------------ | ---------: | --------------------------: | -------: |
| **kusamoji** | **1.4 GB** | 2.2 GB | 3.1 GB |
| kuromoji.js | 6–8 GB | 8+ GB | OOM risk |
With mmap, the ~1.4 GB dictionary sits in the OS page cache, **not V8 heap**. Under memory pressure the OS evicts cold pages automatically. V8's garbage collector never sees the dictionary data.
### Tokenization throughput
| Input | Tokens/call | Latency (p50) | Throughput |
| -------------------------- | ----------: | ------------: | ------------: |
| Short sentence (10 chars) | ~5 | **0.3 ms** | 3,300 calls/s |
| News headline (50 chars) | ~20 | **0.8 ms** | 1,250 calls/s |
| News article (500 chars) | ~150 | **5 ms** | 200 calls/s |
| Long article (2,000 chars) | ~600 | **18 ms** | 55 calls/s |
### Accuracy (6,300-call harness)
700 real-world news snippets from Yahoo News Japan, NHK, and Mainichi — mixed content with ASCII brand names, URLs, numbers, brackets, and quoted English.
You can find the feeding news snippets here [Kusamoji Test News Snippets](https://github.com/KimuraRisei/kusamoji-test-news-snippets)
| Metric | Score |
| ----------------------------------- | ---------------------------------------- |
| Romaji conversion (5 systems × 700) | **99.0%** kanji-free output |
| Kana conversion (4 modes × 700) | **99.0%** kanji-free output |
| Jukujikun (熟字訓) accuracy | **48 / 49** tested compounds |
| Proper noun accuracy (NEologd) | **10 / 10** (大谷翔平, 宮崎駿, etc.) |
| Place name accuracy | **10 / 10** (東京, 鹿児島, 秋葉原, etc.) |
| File descriptor leaks | **0** after 6,300 calls |
### vs. alternatives
| Feature | kusamoji | kuromoji.js | MeCab (C++) | Sudachi (Java/Rust) |
| -------------------- | ----------------------- | -------------- | --------------- | ------------------- |
| Runtime | Node.js | Node.js | Native binary | JVM / Native |
| Dict loading | **mmap (zero-copy)** | gunzip to heap | mmap | mmap (Rust) |
| Boot time (NEologd) | **~1 s** | ~10 s | ~0 s | ~0.2 s |
| RSS (NEologd) | **~1.4 GB** | ~6-8 GB | ~0.5 GB | ~0.2 GB |
| Viterbi optimization | **Length bonus** | None | Cost estimation | CowArray |
| POS source strategy | **Pluggable (3 modes)** | In-heap only | mmap | mmap |
| NEologd support | ✅ | ✅ | ✅ | ✅ (built-in) |
| Node.js native | ✅ | ✅ | FFI required | FFI required |
| npm install | ✅ `npm i kusamoji` | ✅ | ❌ | ❌ |
| Zero native deps | ✅ (optional mmap) | ✅ | N/A | N/A |
> **Note:** MeCab and Sudachi achieve lower RSS because they're compiled languages with direct memory management. kusamoji's mmap addon brings Node.js RSS within 4× of native C++ — the closest any pure-npm Japanese tokenizer has gotten.
## API
### `kusamoji.builder(options)`
Returns a `TokenizerBuilder`.
| Option | Type | Required | Description |
| --------- | -------- | -------- | ---------------------------------------------------- |
| `dicPath` | `string` | Yes | Path to the directory containing the 12 `.dat` files |
### `builder.buildAsync()` → `Promise<Tokenizer>`
Loads the dictionary and returns a `Tokenizer` instance.
### `builder.build(callback)`
Callback-style variant: `callback(err, tokenizer)`.
### `tokenizer.tokenize(text)` → `Token[]`
Tokenizes input text. Returns an array of tokens:
```js
{
surface_form: "東京", // as it appears in the text
pos: "名詞", // part of speech
pos_detail_1: "固有名詞", // POS subcategory 1
pos_detail_2: "地域", // POS subcategory 2
pos_detail_3: "一般", // POS subcategory 3
conjugated_type: "*", // conjugation type
conjugated_form: "*", // conjugated form
basic_form: "東京", // dictionary form
reading: "トウキョウ", // reading in katakana
pronunciation: "トーキョー", // pronunciation in katakana
word_type: "KNOWN", // "KNOWN" or "UNKNOWN"
}
```
Returns `[]` for `null`, `undefined`, or empty string input.
## Dictionary Files
kusamoji does NOT bundle a dictionary. You need 12 **uncompressed** `.dat` files compiled from IPADIC (or IPADIC-format compatible) CSV sources:
```
base.dat, check.dat, cc.dat, tid.dat, tid_map.dat, tid_pos.dat,
unk.dat, unk_char.dat, unk_compat.dat, unk_invoke.dat, unk_map.dat, unk_pos.dat
```
### Building a dictionary
Use the included build script with IPADIC CSV sources:
```bash
node node_modules/kusamoji/dict-source/build.mjs \
--source /path/to/csv-sources \
--output /path/to/output
```
The source directory must contain:
- `ipadic/` — base IPADIC CSV files + `matrix.def`, `char.def`, `unk.def`
- `custom/` — (optional) your own override entries
## License
[BSL 1.1](LICENSE) — free for personal and non-commercial use. Commercial use requires a license. Change date: 4 years from release.