chinese-tokenizer
Version:
Simple algorithm to tokenize Chinese texts into words using CC-CEDICT.
90 lines (72 loc) • 2.37 kB
Markdown
# chinese-tokenizer [](https://travis-ci.org/yishn/chinese-tokenizer)
Simple algorithm to tokenize Chinese texts into words using [CC-CEDICT](https://cc-cedict.org/). You can try it out at [the demo page](https://yishn.github.io/chinese-tokenizer/). The code for the demo page can be found in the [`gh-pages` branch](https://github.com/yishn/chinese-tokenizer/tree/gh-pages) of this repository.
## How this works
This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.
## Installation
Use npm to install:
~~~
npm install chinese-tokenizer --save
~~~
## Usage
Make sure to provide the [CC-CEDICT](https://cc-cedict.org/) data.
~~~js
const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')
console.log(JSON.stringify(tokenize('我是中国人。'), null, ' '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, ' '))
~~~
Output:
~~~js
[
{
"text": "我",
"traditional": "我",
"simplified": "我",
"position": { "offset": 0, "line": 1, "column": 1 },
"matches": [
{
"pinyin": "wo3",
"pinyinPretty": "wǒ",
"english": "I/me/my"
}
]
},
{
"text": "是",
"traditional": "是",
"simplified": "是",
"position": { "offset": 1, "line": 1, "column": 2 },
"matches": [
{
"pinyin": "shi4",
"pinyinPretty": "shì",
"english": "is/are/am/yes/to be"
}
]
},
{
"text": "中國人",
"traditional": "中國人",
"simplified": "中国人",
"position": { "offset": 2, "line": 1, "column": 3 },
"matches": [
{
"pinyin": "Zhong1 guo2 ren2",
"pinyinPretty": "Zhōng guó rén",
"english": "Chinese person"
}
]
},
{
"text": "。",
"traditional": "。",
"simplified": "。",
"position": { "offset": 5, "line": 1, "column": 6 },
"matches": []
}
]
~~~
## API
### `loadFile(path)`
Reads the [CC-CEDICT](https://cc-cedict.org/) file from given `path` and returns a tokenize function based on the dictionary.
### `load(content)`
Parses [CC-CEDICT](https://cc-cedict.org/) string content from `content` and returns a tokenize function based on the dictionary.