@nahanil/zhdict-lite

Version:

Access extended CC-CEDICT dictionary data

63 lines (52 loc) • 3.09 kB

Markdown

# cc-cedict-parser ### Installation - Clone this repository - Install dependencies with `npm install` from the project directory ### Usage - Download the latest CC-CEDICT release from [https://www.mdbg.net/chinese/dictionary?page=cedict](https://www.mdbg.net/chinese/dictionary?page=cedict) - Unzip it - Run this: ```bash node ./lib/parser.js --input ./cedict_ts.u8 --output ./data/cedict.sqlite ``` #### OR ```bash npm run build ``` ### Yeah, it's sluggish So, to avoid waiting around for ths all day set the `--output` directory to some kind of ram disk. Use this for your output directory so the SQLite writes are all done in RAM - should be quicker than hitting your disk. Just make sure that you move the completed database into a 'real' directory after it's been created otherwise it'll disappear at reboot (this is basically what the "build" script does). #### Linux If you're running some variant of Linux you probably have access to the [`/dev/shm`](https://www.cyberciti.biz/tips/what-is-devshm-and-its-practical-usage.html) directory. Use it :) #### OSX If you're running OSX, there's a nifty guide to creating/mounting a RAMDisk [here, on StackOverflow](https://stackoverflow.com/a/2033417/742129). TL;DR - Creating a 100MB RAMDisk ```bash $ hdiutil attach -nomount ram://$((2 * 1024 * 100)) /dev/disk3 $ diskutil eraseVolume HFS+ RAMDisk /dev/disk3 Started erase on disk3 Unmounting disk Erasing Initialized /dev/rdisk3 as a 100 MB case-insensitive HFS Plus volume Mounting disk Finished erase on disk3 RAMDisk ``` Be sure that the last parameter to the second command is the output of the first, otherwise you risk messing up your other partitions :) Open finder - You should see the new RAMDisk in the list of 'Locations', and find it under `/Volumes/RAMDisk` from the command line. #### Windows ¯\\\_(ツ)\_/¯ You're on your own ### Future #### TODO (Maybe) - [ ] It might be nice to store word frequency to order results in a potentially more useful manner - See [SUBTLEX-CH](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729), [Jun Da's Frequency statistics](http://lingua.mtsu.edu/chinese-computing/statistics/bigram/form.php). - [x] The [Unihan Database](http://www.unicode.org/charts/unihan.html) contains some interesting character data (stroke count, radical data, character variations, some characters not in CC-CEDICT, some character parts not in CC-CEDICT) as well that could be brought in. #### Improvements There are a few optimisations that could be made here, though not sure if it would be worth the effort. - [ ] Removing `n-readlines` as a dependency and rather loading the entire file into memory shaved around a minute off the processing time in my (rather un-scientific) tests. - [ ] Batching the SQLite inserts _may_ help to improve throughput, though if I add back-tracking to merge things like the following 國 example, those batches will add complexity/mess things up. - [x] Merge "duplicate" entries like the below: ``` 國国 [Guo2] /surname Guo/ 國国 [guo2] /country/nation/state/national/CL:個|个[ge4]/ ```