UNPKG

ukkonen

Version:

Ukkonens approximate string matching algorithm for finding edit distance similar to Levenshtein

104 lines (64 loc) 3.93 kB
# Ukkonen - Approximate String Matching [![npm version](https://badge.fury.io/js/ukkonen.svg)](https://badge.fury.io/js/ukkonen) [![Checks](https://github.com/sunesimonsen/ukkonen/actions/workflows/ci.yml/badge.svg)](https://github.com/sunesimonsen/ukkonen/actions/workflows/ci.yml) This project implements the [Approximate String Matching algorithm by Esko Ukkonen](https://www.sciencedirect.com/science/article/pii/S0019995885800462) extended with ideas from [An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorith by Hal Berghel and David Roach](http://berghel.net/publications/asm/asm.pdf). Ukkonen's algorithm is very competitive with the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and for longer strings it is much more performant than Levenshtein distance. In addition to being a competitive alternative to Levenshtein distance, Ukkonen's algorithm also allows you to provide a threshold for the distance which increases the performance even more for texts that are longer than the threshold. <img src="./images/leven-for-tree-matching.gif" alt="HTML diffing using Levenshtein" height="200"> <img src="./images/ukkonen-for-tree-matching.gif" alt="HTML diffing using Ukkonen's algorithm" height="200"> Above you can see the different of using Levenshtein distance and Ukkonen's algorithm for matching sub-trees when diffing HTML. ## Install ```sh npm install --save ukkonen ``` ## Usage You can find the distance between the strings `Ukkonen` and `Levenshtein` the following way: ```js var ukkonen = require("ukkonen"); assert.equal(ukkonen("Ukkonen", "Levenshtein"), 8); ``` If you want to limit the distance by a given threshold: ```js var ukkonen = require("ukkonen"); assert.equal(ukkonen("Ukkonen", "Levenshtein", 6), 6); assert.equal(ukkonen("Ukkonen", "Levenshtein", 10), 8); ``` ## Platform support The library is ES6 and will work with any JavaScript bundler in the browser as well as Node versions with ES6 support. ## Benchmark I have benchmarked the library against [the fastest Levenshtein distance implementation on NPM](https://github.com/sindresorhus/leven). ``` Running benchmarks with 1000 iterations # ukkonen: Edit distance one word (14 examples) ok ~18 ms (0 s + 17993165 ns) # leven: Edit distance one word (14 examples) ok ~13 ms (0 s + 13155407 ns) # ukkonen: Edit distance on sentence with small differences ok ~1.66 ms (0 s + 1656841 ns) # leven: Edit distance on sentence with small differences ok ~7.23 ms (0 s + 7233814 ns) # ukkonen: Edit distance on paragraphs with small differences ok ~5.37 ms (0 s + 5367561 ns) # leven: Edit distance on paragraphs with small differences ok ~416 ms (0 s + 416468504 ns) # ukkonen: Edit distance on longer texts with small differences ok ~10 ms (0 s + 10305586 ns) # leven: Edit distance on longer texts with small differences ok ~1.7 s (1 s + 703731130 ns) # ukkonen: Edit distance on longer texts with many differences ok ~3.28 s (3 s + 280166305 ns) # leven: Edit distance on longer texts with many differences ok ~2.52 s (2 s + 519432479 ns) # ukkonen: Edit distance on longer texts with small differences and a threshold of 20 ok ~9.69 ms (0 s + 9691021 ns) # leven: Edit distance on longer texts with small differences and a threshold of 20 ok ~1.61 s (1 s + 610079082 ns) # ukkonen: Edit distance on longer texts with many differences and a threshold of 40 ok ~15 ms (0 s + 15225792 ns) # leven: Edit distance on longer texts with many differences and a threshold of 40 ok ~2.54 s (2 s + 539519721 ns) ``` ## Acknowledgements Obviously the authors of the papers describing the algorithm Esko Ukkonen, Hal Berghel and David Roach. I stole a lot of ideas from [Sindre Sorhus](https://github.com/sindresorhus)'s [leven](https://github.com/sindresorhus/leven) library and I also used it to test my implementation against. ## License [MIT © Sune Simonsen](./LICENSE)