UNPKG

@a-s8h/liblevenshtein

Version:

Various utilities regarding Levenshtein transducers.

157 lines (123 loc) 6.46 kB
# liblevenshtein ## CoffeeScript / JavaScript / Node.js ### A library for generating Finite State Transducers based on Levenshtein Automata. [![npm version][npm-version-badge]][npm-repo] [![Build Status][travis-ci-badge]][travis-ci] [![Join the chat at https://gitter.im/universal-automata/liblevenshtein-coffeescript][gitter-badge]][gitter-channel] Levenshtein transducers accept a query term and return all terms in a dictionary that are within n spelling errors away from it. They constitute a highly-efficient (space _and_ time) class of spelling correctors that work very well when you do not require context while making suggestions. Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) or [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance), these babies find _all_ the terms from your dictionary in linear time _on the length of the query term_ (not on the size of the dictionary, on the length of the query term). If you need context, then take the candidates generated by the transducer as a starting place, and plug them into whatever model you're using for context (such as by selecting the sequence of terms that have the greatest probability of appearing together). For a quick demonstration, please visit the [Github Page, here](http://universal-automata.github.io/liblevenshtein/). The library is currently written in Java, CoffeeScript, and JavaScript, but I will be porting it to other languages, soon. If you have a specific language you would like to see it in, or package-management system you would like it deployed to, let me know. ### Basic Usage: #### Node.js Install the module via `npm`: ``` % npm install liblevenshtein info trying registry request attempt 1 at 12:59:16 http GET https://registry.npmjs.org/liblevenshtein http 304 https://registry.npmjs.org/liblevenshtein liblevenshtein@2.0.4 node_modules/liblevenshtein ``` Then, you may `require` it to do whatever you need: ```javascript var levenshtein = require('liblevenshtein'); // Assume "completion_list" is a list of terms you want to match against in // fuzzy queries. var builder = new levenshtein.Builder() .dictionary(completion_list, false) // generate spelling candidates from unsorted completion_list .algorithm("transposition") // use Levenshtein distance extended with transposition .sort_candidates(true) // sort the spelling candidates before returning them .case_insensitive_sort(true) // ignore character-casing while sorting terms .include_distance(false) // just return the ordered terms (drop the distances) .maximum_candidates(10); // only want the top-10 candidates // Maximum number of spelling errors we will allow the spelling candidates to // have, with regard to the query term. var MAX_EDIT_DISTANCE = 2; var transducer = builder.build(); // Assume "term" corresponds to some query term. Once invoking // transducer.transduce(term, MAX_EDIT_DISTANCE), candidates will contain a list // of all spelling candidates from the completion list that are within // MAX_EDIT_DISTANCE units of error from the query term. var candidates = transducer.transduce(term, MAX_EDIT_DISTANCE); ``` #### In the Browser To use the library on your website, reference the desired file from the `<head/>` of your document, like so: ```html <!DOCTYPE html> <html> <head> <!-- stuff ... --> <script type="text/javascript" src="http://universal-automata.github.com/liblevenshtein/javascripts/2.0.4/levenshtein-transducer.min.js"> </script> <!-- more stuff ... --> </head> <body> <!-- yet another fancy document ... --> </body> </html> ``` Once the script loads, you should construct a transducer via the [Builder Api](http://universal-automata.github.io/liblevenshtein/docs/coffeescript/builder.html): ```javascript $(function ($) { "use strict"; // Maximum number of spelling errors we will allow the spelling candidates to // have, with regard to the query term. var MAX_EDIT_DISTANCE = 2; var completion_list = getCompletionList(); // fictitious method var builder = new levenshtein.Builder() .dictionary(completion_list, false) // generate spelling candidates from unsorted completion_list .algorithm("transposition") // use Levenshtein distance extended with transposition .sort_candidates(true) // sort the spelling candidates before returning them .case_insensitive_sort(true) // ignore character-casing while sorting terms .include_distance(false) // just return the ordered terms (drop the distances) .maximum_candidates(10); // only want the top-10 candidates var transducer = builder.build(); var $queryTerm = $('#query-term-input-field'); $queryTerm.keyup(function (event) { var candidates, term = $.trim($queryTerm.val()); if (term) { candidates = transducer.transduce(term, MAX_EDIT_DISTANCE); printAutoComplete(candidates); // print the list of completions } else { clearAutoComplete(); // user has cleared the search box } return true; }); }); ``` This will give the user autocompletion hints as he types in the search box. ### Reference This library is based largely on the work of [Stoyan Mihov](http://www.lml.bas.bg/~stoyan/), [Klaus Schulz](http://www.cis.uni-muenchen.de/people/schulz.html), and Petar Nikolaev Mitankin: "[Fast String Correction with Levenshtein-Automata](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 "Klaus Schulz and Stoyan Mihov (2002)")". For more details, please see the [wiki](https://github.com/universal-automata/liblevenshtein/wiki). [travis-ci-badge]: https://travis-ci.org/universal-automata/liblevenshtein-coffeescript.svg?branch=master [travis-ci]: https://travis-ci.org/universal-automata/liblevenshtein-coffeescript [npm-version-badge]: https://badge.fury.io/js/liblevenshtein.svg [npm-repo]: https://www.npmjs.com/package/liblevenshtein [gitter-badge]: https://badges.gitter.im/universal-automata/liblevenshtein-coffeescript.svg [gitter-channel]: https://gitter.im/universal-automata/liblevenshtein-coffeescript?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge