UNPKG

node-simhash-mod

Version:

Command Line tool that compares two text files using simhash

89 lines (53 loc) 3.02 kB
# node-simhash-mod 64 bits version now warning, much slower A simple command line tool for comparing text files using the simhash algorithm and contrasting with the jaccard index. Almost pure fork of [node-simhash, by Scott Horn](https://github.com/sjhorn/node-simhash): - Patches [log4js issue](https://github.com/sjhorn/node-simhash/issues/1) by setting a forced version of log4js - Cleans French diacritics - `getDistanceReport` helper function ## References - [Near duplicate detection (moz.com)](https://moz.com/devblog/near-duplicate-detection/) - [Near duplicate detection (Jonathan Koren)](https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a) ## Installation ### If you have just clone this like then run the following ```` npm install npm link ```` ## Command line tool usage Using node ```` simhash file1.txt file2.txt simhash https://file.com/page1.html https://file.com/page2.html ```` ### Using lib ````js var simhash = require('node-simhash-mod'); simhash.compare(string1, string2); ```` ### Methods #### <a name="summary"></a>.summary(file1, file2) Compare two text strings using both simhash and jaccard index and print a summary #### <a name="compare"></a>.compare(file1, file2) Compare two text strings using both simhash and jaccard index #### <a name="hammingWeight"></a>.hammingWeight(number) Count the binary ones in a number. #### <a name="shingles"></a>.shingles(string, words_per_single=2) Convert string to set of shingles using the default of 2 words per shingle and tokenize using the natural libraries default tokenizer. #### <a name="jaccardIndex"></a>.jaccardIndex(string1, string2) Compare two strings by tokeniseing and then compare the intersection of shingles to the union of shingles. #### <a name="createBinaryString"></a>.createBinaryString(number) Print a 32-bit number as a binary string of 32 characters #### <a name="shingleHashList"></a>.shingleHashList(set) Convert a set of shingles to a set of crc-32 hashes. #### Distance report Often you have a list of strings, and what to check how close they are each from other. `getDistanceReport` will produce a JSON report containing, for each text, the closest ones. Parameters are the following: - an array of textual objects; each object `must` have a `text` property containing its string, and a `simhash` property with the hash already calculated; feel free to put other properties typically an ID - the maximal acceptable similarity: if the similarity between two strings is greater than this threshold, then it will be added in the list of the closest ones; use 0.8 for instance to only trigger when texts are 80% different or less - the maximum number of closest strings to be given in the output (only the most close ones will be given) The output is an array of objects: - `for`: reference to the textual object - `closestOnes`: an array with the closes elements; each object points to an element (`with` property) and gives the distance (`difference` property)