distance-sensitive-hash
Version:
Lightning-fast Euclidean distance search using locality-sensitive hashing. Compress high-dimensional vectors into tiny fingerprints for 10× faster similarity search with 99% less memory usage.
151 lines (108 loc) • 5.59 kB
Markdown
# DistanceSensitiveHash - Lightning-Fast Euclidean Distance Search
[](https://nodei.co/npm/distance-sensitive-hash/)
**Find similar items in microseconds.** This library compresses high-dimensional vectors (embeddings, features, user preferences) into tiny fingerprints that let you estimate Euclidean similarity **9× faster** while using **94% less memory** than naive Euclidean distance comparison. Perfect for rapidly narrowing a set of options, and edge deployments.
## What's the Problem?
You have 100,000 high-dimensional vectors and need to find similar ones fast. Works for anything vector-shaped:
- Product feature vectors
- Image / text embeddings
- User preference vectors
- Audio fingerprints
- Time series data
## How It Works (30-Second Version)
1. Feed your vector into `encodeE2LSH()` → get a compact integer signature (16-512 integers)
2. Compare signatures with `estimateSimilarity()` → Euclidean similarity in microseconds
3. Profit
Trade-off: 5-15% error for **orders of magnitude** speed-up and **massive memory savings**.
## Bundle Size & Performance
**Tiny footprint**: Core library is only **~8KB** (gzipped: ~3KB).
**Works everywhere**:
- ✅ **Node.js** (v14+)
- ✅ **Browser** (ES modules)
- ✅ **Web Workers** (thread-safe)
## Quick Start
```bash
npm install distance-sensitive-hash
```
```javascript
import { encodeE2LSH, estimateSimilarity } from 'distance-sensitive-hash';
const productA = [0.8, 0.2, 0.9, /*...512 dims...*/ 0.1, 0.7];
const productB = [0.7, 0.3, 0.8, /*...512 dims...*/ 0.2, 0.6];
// 1. Hash once (64 integers)
const fpA = encodeE2LSH(productA, 64);
const fpB = encodeE2LSH(productB, 64);
// 2. Compare in <1 μs
const sim = estimateSimilarity(fpA, fpB);
console.log('Similarity ≈', sim); // 0.0 – 1.0
```
## Two Algorithms, One API
| Name | Function | When to Pick |
|---|---|---|
| **CS-E2LSH** (default) | `encodeCSE2LSH()` | **3× faster encoding** (recommended) |
| **HCS-E2LSH** | `encodeHCSE2LSH()` | **Better accuracy for medium distances** |
Pick based on your use case - CS for speed, HCS for better distance estimation.
## API Cheatsheet
```javascript
encodeE2LSH(vector, bits, seed = 42, variant = 'cs', w = 4.0, order = 2)
// vector: number[] – your data
// bits: int – signature size (16-512 recommended)
// seed: int – reproducibility
// variant: string – 'cs' or 'hcs'
// w: float – width parameter (4.0 default)
// order: int – tensor order for HCS (2-4)
// returns: Int32Array – compact fingerprint
estimateSimilarity(fpA, fpB, w = 4)
// fpA, fpB: Int32Array – fingerprints
// w: float – width parameter used in encoding
// returns: float 0-1 – Euclidean similarity estimate
```
## Benchmarks
| Task | Ops/sec | vs Vanilla | Accuracy |
|---|---|---|---|
| **CS-E2LSH encode** (512d→64-int) | 66,701 sigs/sec | — | 85-95% |
| **CS-E2LSH compare** | 8.7M ops/sec | **9.5× faster** | ±0.15 @ ≥0.7 |
| **HCS-E2LSH encode** (512d→64-int) | 7,857 sigs/sec | — | 85-95% |
| **HCS-E2LSH compare** | 8.8M ops/sec | **9.6× faster** | ±0.13 @ ≥0.7 |
| Vanilla Euclidean | 915k ops/sec | baseline | exact |
### Performance by Signature Size
| Size | Generation | Comparison | Memory | Accuracy |
|------|------------|------------|--------|----------|
| 16 | 50.8k/s | 9.3M/s | 64B | ±0.10 |
| 32 | 81.5k/s | 14.6M/s | 128B | ±0.13 |
| 64 | 66.7k/s | 8.7M/s | 256B | ±0.15 |
| 128 | 46.4k/s | 4.0M/s | 512B | ±0.13 |
| 256 | 15.5k/s | 2.0M/s | 1KB | ±0.09 |
### Distance Accuracy Examples
| True Distance | CS-E2LSH Error | HCS-E2LSH Error |
|---------------|----------------|-----------------|
| 0.377 (close) | ±0.064 (17%) | ±0.171 (45%) |
| 1.405 (medium)| ±0.543 (39%) | ±0.318 (23%) |
| 3.443 (far) | ±0.466 (14%) | ±0.231 (7%) |
## Memory Savings
**Massive reduction** compared to full vector storage:
| Vector Size | Full Vector | 64-int Signature | Reduction |
|-------------|-------------|------------------|-----------|
| 512-d float | 2,048 bytes | 256 bytes | **8× smaller** |
| 1024-d float | 4,096 bytes | 256 bytes | **16× smaller** |
| 2048-d float | 8,192 bytes | 256 bytes | **32× smaller** |
**Real-world impact:**
- **1M product vectors** (512-d): 2GB → 256MB (**87.5% reduction**)
- **10M user vectors** (1024-d): 40GB → 2.5GB (**93.75% reduction**)
- **100M image vectors** (2048-d): 800GB → 25GB (**96.9% reduction**)
## When to Use / Not Use
✅ **Use for:**
- Large datasets (≥10k items)
- Real-time recommendations
- Memory-constrained environments
- Approximate nearest neighbor search
- Pre-filtering before exact comparisons
❌ **Don't use for:**
- Exact similarity requirements
- Very small datasets (<100 items)
- Cosine similarity (use [SimilaritySensitiveHash](https://npm.im/similarity-sensitive-hash))
- Jaccard similarity (use [Grouped-OPH](https://npm.im/grouped-oph))
## Academic Roots
Implements **Count-Sketch E2LSH (CS-E2LSH)** and **Higher-Order Count-Sketch E2LSH (HCS-E2LSH)** from:
> [Verma & Pratap, *Improving E2LSH with Count-Sketch and Higher-Order Count-Sketch*, arXiv:2503.06737](https://arxiv.org/abs/2503.06737)
Built on the classic LSH framework (Gionis, Indyk, Motwani, VLDB 1999) and Euclidean LSH (Datar et al., SoCG 2004).
## License
MIT