UNPKG

kusamoji

Version:

Japanese morphological analyzer for Node.js — Viterbi tokenizer with mmap dict loading and pluggable POS-source strategy

282 lines (207 loc) 12.7 kB
# Kusamoji 草文字 Segments Japanese text into morphemes and attaches part of speech, reading, and pronunciation metadata. ## Features - **Viterbi tokenization** with IPADIC/NEologd dictionary support - **Custom dictionary** — bring your own IPADIC/NEologd `.dat` files - **OS-level native dict loading** — loads dictionary via memory-mapped I/O for near-instant boot (~1s vs ~4s) and OS-managed page cache - **Automatic memory management** — lets the OS handle page caching; no manual tuning needed - **Viterbi length bonus** — prevents short dictionary fragments from stealing prefixes of longer correct matches - **Zero-copy TypedArray access** to binary dictionary data ## Install ```bash pnpm install kusamoji # or npm add kusamoji ``` ### How the native mmap addon works kusamoji ships pre-compiled mmap binaries for common platforms. The addon is **optional** — kusamoji works without it, just with slower boot and higher RAM. **You don't need to do anything.** On first use, kusamoji automatically: 1. Finds the matching prebuilt binary inside the package (`src/native/prebuilds/`) 2. Copies it to `~/.kusamoji/` for persistence across reinstalls 3. Loads it — mmap dict loading is now active If no prebuilt matches your platform, kusamoji silently falls back to `fs.readFile`. Everything works — the mmap addon is a performance optimization, not a requirement. ### Shipped prebuilts | Platform | Architecture | Status | | -------- | --------------------- | -------------------------- | | macOS | Apple Silicon (arm64) | ✅ Shipped | | macOS | Intel (x64) | Compile from source | | Linux | x64 (Intel/AMD) | ✅ Shipped | | Linux | arm64 (Graviton, RPi) | ✅ Shipped | | Windows | any | Not supported (POSIX only) | ### Troubleshooting the native addon **"I installed kusamoji but I'm not sure if mmap is active"** ```bash node -e " const path = require('path'); const loader = require(path.join(require.resolve('kusamoji'), '..', 'native', 'loader.js')); const addon = loader.loadMmapAddon(); console.log(addon ? 'mmap is ACTIVE' : 'mmap is NOT active (using fs.readFile fallback)'); " ``` **"pnpm install didn't set up the addon"** This is normal. pnpm may skip `postinstall` scripts for security. The addon is loaded lazily on first use — no manual setup needed. If you want to pre-warm the cache: ```bash pnpm rebuild kusamoji ``` **"I want to compile the addon from source"** For platforms without a shipped prebuilt, or if you want to rebuild: ```bash npx kusamoji rebuild-native ``` Requires: C compiler (gcc/clang), Python 3. The compiled binary is cached at `~/.kusamoji/` and persists across `pnpm install` cycles. **"I'm on an unsupported platform"** kusamoji falls back to `fs.readFile` automatically. Dictionary loading still works — boot is ~3-4s instead of ~1s, and RAM is higher (~2.5 GB vs ~1.4 GB for NEologd). No action needed. ### Binary cache directory (`~/.kusamoji/`) The native addon binary is cached at `~/.kusamoji/` along with a `config.json` metadata file. This cache: - Survives `pnpm install` / `npm install` cycles - Is validated against your Node.js N-API version on each load - Is automatically refreshed when you upgrade Node.js to a new major version - Can be safely deleted — it will be recreated on next use ## Quick Start ```js const kusamoji = require('kusamoji') const tokenizer = await kusamoji.builder({ dicPath: '/path/to/dict' }).buildAsync() const tokens = tokenizer.tokenize('大谷翔平がロサンゼルス・ドジャースで3本塁打を放った') for (const token of tokens) { console.log(token.surface_form, token.reading, token.pos) } // 大谷翔平 オオタニショウヘイ 名詞 // が ガ 助詞 // ロサンゼルス ロサンゼルス 名詞 // ・ ・ 記号 // ドジャース ドジャース 名詞 // で デ 助詞 // 3 サン 名詞 // 本塁打 ホンルイダ 名詞 // を ヲ 助詞 // 放っ ハナッ 動詞 // た タ 助動詞 ``` ### More examples Dates, counters, and proper nouns are resolved natively from the dictionary — no preprocessing needed: ```js tokenizer.tokenize('2026年4月9日、川崎市の製鉄所で作業員が転落する事故が発生した') // 2026年 ニセンニジュウロクネン 名詞 ← full year reading // 4月9日 シガツココノカ 名詞 ← month + day as one token // 、 、 記号 // 川崎市 カワサキシ 名詞 ← place name // の ノ 助詞 // 製鉄所 セイテツジョ 名詞 ← rendaku: 所(ショ→ジョ) // で デ 助詞 // 作業員 サギョウイン 名詞 // が ガ 助詞 // 転落 テンラク 名詞 // する スル 動詞 // 事故 ジコ 名詞 // が ガ 助詞 // 発生 ハッセイ 名詞 // し シ 動詞 // た タ 助動詞 tokenizer.tokenize('藤井聡太名人は第84期将棋名人戦で圧倒的な強さを見せた') // 藤井聡太 フジイソウタ 名詞 ← NEologd proper noun // 名人 メイジン 名詞 // は ハ 助詞 // 第 ダイ 接頭詞 // 84期 ハチジュウヨンキ 名詞 ← digit+counter compound // 将棋 ショウギ 名詞 // 名人戦 メイジンセン 名詞 // で デ 助詞 // 圧倒的 アットウテキ 名詞 // な ナ 助動詞 // 強 ツヨ 形容詞 // さ サ 名詞 // を ヲ 助詞 // 見せ ミセ 動詞 // た タ 助動詞 ``` ## Benchmarks All numbers measured on Apple M1 Pro, Node.js 22, NEologd dictionary (6.1M entries, 1.4 GB uncompressed). Methodology: 700 real-world Japanese news snippets × 9 conversion variants = 6,300 HTTP calls end-to-end through an Express service. ### Cold start | Mode | Boot time | Ready for first query | | ------------ | --------: | --------------------------------------------------- | | **kusamoji** | **1.0 s** | Dictionary memory-mapped, OS demand-pages on access | | kuromoji.js | 8–12 s | gunzip + parse all 12 .dat.gz files | ### Runtime memory (RSS) | Mode | Idle RSS | Under load (700 concurrent) | Peak | | ------------ | ---------: | --------------------------: | -------: | | **kusamoji** | **1.4 GB** | 2.2 GB | 3.1 GB | | kuromoji.js | 6–8 GB | 8+ GB | OOM risk | With mmap, the ~1.4 GB dictionary sits in the OS page cache, **not V8 heap**. Under memory pressure the OS evicts cold pages automatically. V8's garbage collector never sees the dictionary data. ### Tokenization throughput | Input | Tokens/call | Latency (p50) | Throughput | | -------------------------- | ----------: | ------------: | ------------: | | Short sentence (10 chars) | ~5 | **0.3 ms** | 3,300 calls/s | | News headline (50 chars) | ~20 | **0.8 ms** | 1,250 calls/s | | News article (500 chars) | ~150 | **5 ms** | 200 calls/s | | Long article (2,000 chars) | ~600 | **18 ms** | 55 calls/s | ### Accuracy (6,300-call harness) 700 real-world news snippets from Yahoo News Japan, NHK, and Mainichi — mixed content with ASCII brand names, URLs, numbers, brackets, and quoted English. You can find the feeding news snippets here [Kusamoji Test News Snippets](https://github.com/KimuraRisei/kusamoji-test-news-snippets) | Metric | Score | | ----------------------------------- | ---------------------------------------- | | Romaji conversion (5 systems × 700) | **99.0%** kanji-free output | | Kana conversion (4 modes × 700) | **99.0%** kanji-free output | | Jukujikun (熟字訓) accuracy | **48 / 49** tested compounds | | Proper noun accuracy (NEologd) | **10 / 10** (大谷翔平, 宮崎駿, etc.) | | Place name accuracy | **10 / 10** (東京, 鹿児島, 秋葉原, etc.) | | File descriptor leaks | **0** after 6,300 calls | ### vs. alternatives | Feature | kusamoji | kuromoji.js | MeCab (C++) | Sudachi (Java/Rust) | | -------------------- | ----------------------- | -------------- | --------------- | ------------------- | | Runtime | Node.js | Node.js | Native binary | JVM / Native | | Dict loading | **mmap (zero-copy)** | gunzip to heap | mmap | mmap (Rust) | | Boot time (NEologd) | **~1 s** | ~10 s | ~0 s | ~0.2 s | | RSS (NEologd) | **~1.4 GB** | ~6-8 GB | ~0.5 GB | ~0.2 GB | | Viterbi optimization | **Length bonus** | None | Cost estimation | CowArray | | POS source strategy | **Pluggable (3 modes)** | In-heap only | mmap | mmap | | NEologd support | ✅ | ✅ | ✅ | ✅ (built-in) | | Node.js native | ✅ | ✅ | FFI required | FFI required | | npm install | ✅ `npm i kusamoji` | ✅ | ❌ | ❌ | | Zero native deps | ✅ (optional mmap) | ✅ | N/A | N/A | > **Note:** MeCab and Sudachi achieve lower RSS because they're compiled languages with direct memory management. kusamoji's mmap addon brings Node.js RSS within 4× of native C++ — the closest any pure-npm Japanese tokenizer has gotten. ## API ### `kusamoji.builder(options)` Returns a `TokenizerBuilder`. | Option | Type | Required | Description | | --------- | -------- | -------- | ---------------------------------------------------- | | `dicPath` | `string` | Yes | Path to the directory containing the 12 `.dat` files | ### `builder.buildAsync()` → `Promise<Tokenizer>` Loads the dictionary and returns a `Tokenizer` instance. ### `builder.build(callback)` Callback-style variant: `callback(err, tokenizer)`. ### `tokenizer.tokenize(text)` → `Token[]` Tokenizes input text. Returns an array of tokens: ```js { surface_form: "東京", // as it appears in the text pos: "名詞", // part of speech pos_detail_1: "固有名詞", // POS subcategory 1 pos_detail_2: "地域", // POS subcategory 2 pos_detail_3: "一般", // POS subcategory 3 conjugated_type: "*", // conjugation type conjugated_form: "*", // conjugated form basic_form: "東京", // dictionary form reading: "トウキョウ", // reading in katakana pronunciation: "トーキョー", // pronunciation in katakana word_type: "KNOWN", // "KNOWN" or "UNKNOWN" } ``` Returns `[]` for `null`, `undefined`, or empty string input. ## Dictionary Files kusamoji does NOT bundle a dictionary. You need 12 **uncompressed** `.dat` files compiled from IPADIC (or IPADIC-format compatible) CSV sources: ``` base.dat, check.dat, cc.dat, tid.dat, tid_map.dat, tid_pos.dat, unk.dat, unk_char.dat, unk_compat.dat, unk_invoke.dat, unk_map.dat, unk_pos.dat ``` ### Building a dictionary Use the included build script with IPADIC CSV sources: ```bash node node_modules/kusamoji/dict-source/build.mjs \ --source /path/to/csv-sources \ --output /path/to/output ``` The source directory must contain: - `ipadic/` — base IPADIC CSV files + `matrix.def`, `char.def`, `unk.def` - `custom/` — (optional) your own override entries ## License [BSL 1.1](LICENSE) — free for personal and non-commercial use. Commercial use requires a license. Change date: 4 years from release.