superminhash
Version:
TypeScript implementation of the SuperMinHash algorithm for Jaccard similarity estimation
120 lines (80 loc) • 3.35 kB
Markdown
SuperMinHash is a probabilistic data structure for estimating the [Jaccard similarity index](https://en.wikipedia.org/wiki/Jaccard_index) between sets.
It provides a significant improvement over traditional MinHash by requiring less memory while maintaining comparable accuracy.
This is a TypeScript implementation based on the paper "[SuperMinHash – A New Minwise Hashing Algorithm for Jaccard Similarity Estimation](https://arxiv.org/pdf/1706.05698)" by Otmar Ertl.
```bash
npm install superminhash
```
```typescript
import { SuperMinHash } from 'superminhash';
// Create two sets
const set1 = ['apple', 'banana', 'orange', 'pear'];
const set2 = ['banana', 'orange', 'kiwi', 'grape'];
// Create MinHash signatures for both sets
const minhash1 = SuperMinHash.fromIterable(set1);
const minhash2 = SuperMinHash.fromIterable(set2);
// Calculate similarity (estimated Jaccard similarity)
const similarity = minhash1.similarity(minhash2);
console.log(`Similarity: ${similarity}`); // Expected ~0.3 (2/6)
```
```typescript
// Default parameters (signature size of 128, default seed)
const minhash = new SuperMinHash();
// Custom signature size (larger = more accurate but uses more memory)
const minhash = new SuperMinHash(256);
// Custom signature size and seed
const minhash = new SuperMinHash(128, 42);
// Create directly from an iterable
const minhash = SuperMinHash.fromIterable(['a', 'b', 'c']);
```
```typescript
// Add elements as an array
minhash.add(['apple', 'banana', 'orange']);
// Add elements from a Set
const fruitSet = new Set(['apple', 'banana', 'orange']);
minhash.add(fruitSet);
// Add elements incrementally
minhash.add(['apple']);
minhash.add(['banana']);
minhash.add(['orange']);
```
```typescript
// Compare two SuperMinHash instances
const similarity = minhash1.similarity(minhash2);
// Alternatively, use getJaccardIndex
const jaccardIndex = minhash1.getJaccardIndex(minhash2);
```
```typescript
// Serialize to a binary format (Uint8Array)
const binaryData = minhash.serialize();
// Store the binary data (e.g., in a database, file, etc.)
// ...
// Later, deserialize it back into a SuperMinHash instance
const restoredMinHash = SuperMinHash.deserialize(binaryData);
// Compare serialized signatures directly
const similarity = SuperMinHash.compareSerialized(binaryData1, binaryData2);
```
```typescript
// Get the underlying signature
const signature = minhash.getSignature(); // Returns Uint32Array
// Create a SuperMinHash from an existing signature
const newMinHash = SuperMinHash.fromRawSignature(signature, seed);
```
1. **Choose the right signature size**: Larger signatures give more accurate similarity estimates but use more memory.
- For most applications, 128-256 elements work well
- For higher precision, use 512-1024 elements
2. **Reuse seeds**: Always use the same seed when comparing signatures.
3. **Preprocess large inputs**: Consider preprocessing large inputs before adding them to SuperMinHash to reduce memory usage.
[](LICENSE)
Contributions are welcome! Please feel free to submit a Pull Request.