node-simhash-mod
Version:
Command Line tool that compares two text files using simhash
89 lines (53 loc) • 3.02 kB
Markdown
# node-simhash-mod
64 bits version now
warning, much slower
A simple command line tool for comparing text files using the simhash algorithm and contrasting with the jaccard index.
Almost pure fork of [node-simhash, by Scott Horn](https://github.com/sjhorn/node-simhash):
- Patches [log4js issue](https://github.com/sjhorn/node-simhash/issues/1) by setting a forced version of log4js
- Cleans French diacritics
- `getDistanceReport` helper function
## References
- [Near duplicate detection (moz.com)](https://moz.com/devblog/near-duplicate-detection/)
- [Near duplicate detection (Jonathan Koren)](https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a)
## Installation
### If you have just clone this like then run the following
````
npm install
npm link
````
## Command line tool usage
Using node
````
simhash file1.txt file2.txt
simhash https://file.com/page1.html https://file.com/page2.html
````
### Using lib
````js
var simhash = require('node-simhash-mod');
simhash.compare(string1, string2);
````
### Methods
#### <a name="summary"></a>.summary(file1, file2)
Compare two text strings using both simhash and jaccard index and print a summary
#### <a name="compare"></a>.compare(file1, file2)
Compare two text strings using both simhash and jaccard index
#### <a name="hammingWeight"></a>.hammingWeight(number)
Count the binary ones in a number.
#### <a name="shingles"></a>.shingles(string, words_per_single=2)
Convert string to set of shingles using the default of 2 words per shingle and tokenize using the natural libraries default tokenizer.
#### <a name="jaccardIndex"></a>.jaccardIndex(string1, string2)
Compare two strings by tokeniseing and then compare the intersection of shingles to the union of shingles.
#### <a name="createBinaryString"></a>.createBinaryString(number)
Print a 32-bit number as a binary string of 32 characters
#### <a name="shingleHashList"></a>.shingleHashList(set)
Convert a set of shingles to a set of crc-32 hashes.
#### Distance report
Often you have a list of strings, and what to check how close they are each from other.
`getDistanceReport` will produce a JSON report containing, for each text, the closest ones.
Parameters are the following:
- an array of textual objects; each object `must` have a `text` property containing its string, and a `simhash` property with the hash already calculated; feel free to put other properties typically an ID
- the maximal acceptable similarity: if the similarity between two strings is greater than this threshold, then it will be added in the list of the closest ones; use 0.8 for instance to only trigger when texts are 80% different or less
- the maximum number of closest strings to be given in the output (only the most close ones will be given)
The output is an array of objects:
- `for`: reference to the textual object
- `closestOnes`: an array with the closes elements; each object points to an element (`with` property) and gives the distance (`difference` property)