@nahanil/zhdict-lite
Version:
Access extended CC-CEDICT dictionary data
63 lines (52 loc) • 3.09 kB
Markdown
# cc-cedict-parser
### Installation
- Clone this repository
- Install dependencies with `npm install` from the project directory
### Usage
- Download the latest CC-CEDICT release from [https://www.mdbg.net/chinese/dictionary?page=cedict](https://www.mdbg.net/chinese/dictionary?page=cedict)
- Unzip it
- Run this:
```bash
node ./lib/parser.js --input ./cedict_ts.u8 --output ./data/cedict.sqlite
```
#### OR
```bash
npm run build
```
### Yeah, it's sluggish
So, to avoid waiting around for ths all day set the `--output` directory to some kind of ram disk.
Use this for your output directory so the SQLite writes are all done in RAM - should be quicker than hitting your disk. Just make sure that you move the completed database into a 'real' directory after it's been created otherwise it'll disappear at reboot (this is basically what the "build" script does).
#### Linux
If you're running some variant of Linux you probably have access to the [`/dev/shm`](https://www.cyberciti.biz/tips/what-is-devshm-and-its-practical-usage.html) directory. Use it :)
#### OSX
If you're running OSX, there's a nifty guide to creating/mounting a RAMDisk [here, on StackOverflow](https://stackoverflow.com/a/2033417/742129).
TL;DR - Creating a 100MB RAMDisk
```bash
$ hdiutil attach -nomount ram://$((2 * 1024 * 100))
/dev/disk3
$ diskutil eraseVolume HFS+ RAMDisk /dev/disk3
Started erase on disk3
Unmounting disk
Erasing
Initialized /dev/rdisk3 as a 100 MB case-insensitive HFS Plus volume
Mounting disk
Finished erase on disk3 RAMDisk
```
Be sure that the last parameter to the second command is the output of the first, otherwise you risk messing up your other partitions :)
Open finder - You should see the new RAMDisk in the list of 'Locations', and find it under `/Volumes/RAMDisk` from the command line.
#### Windows
¯\\\_(ツ)\_/¯
You're on your own
### Future
#### TODO (Maybe)
- [ ] It might be nice to store word frequency to order results in a potentially more useful manner - See [SUBTLEX-CH](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729), [Jun Da's Frequency statistics](http://lingua.mtsu.edu/chinese-computing/statistics/bigram/form.php).
- [x] The [Unihan Database](http://www.unicode.org/charts/unihan.html) contains some interesting character data (stroke count, radical data, character variations, some characters not in CC-CEDICT, some character parts not in CC-CEDICT) as well that could be brought in.
#### Improvements
There are a few optimisations that could be made here, though not sure if it would be worth the effort.
- [ ] Removing `n-readlines` as a dependency and rather loading the entire file into memory shaved around a minute off the processing time in my (rather un-scientific) tests.
- [ ] Batching the SQLite inserts _may_ help to improve throughput, though if I add back-tracking to merge things like the following 國 example, those batches will add complexity/mess things up.
- [x] Merge "duplicate" entries like the below:
```
國 国 [Guo2] /surname Guo/
國 国 [guo2] /country/nation/state/national/CL:個|个[ge4]/
```