distance-sensitive-hash

Version:

Lightning-fast Euclidean distance search using locality-sensitive hashing. Compress high-dimensional vectors into tiny fingerprints for 10× faster similarity search with 99% less memory usage.

github.com/JDvorak/distance-sensitive-hash

JDvorak/distance-sensitive-hash

151 lines (108 loc) • 5.59 kB

Markdown

# DistanceSensitiveHash - Lightning-Fast Euclidean Distance Search [![npm package](https://nodei.co/npm/distance-sensitive-hash.png?downloads=true&stars=true)](https://nodei.co/npm/distance-sensitive-hash/) **Find similar items in microseconds.** This library compresses high-dimensional vectors (embeddings, features, user preferences) into tiny fingerprints that let you estimate Euclidean similarity **9× faster** while using **94% less memory** than naive Euclidean distance comparison. Perfect for rapidly narrowing a set of options, and edge deployments. ## What's the Problem? You have 100,000 high-dimensional vectors and need to find similar ones fast. Works for anything vector-shaped: - Product feature vectors - Image / text embeddings - User preference vectors - Audio fingerprints - Time series data ## How It Works (30-Second Version) 1. Feed your vector into `encodeE2LSH()` → get a compact integer signature (16-512 integers) 2. Compare signatures with `estimateSimilarity()` → Euclidean similarity in microseconds 3. Profit Trade-off: 5-15% error for **orders of magnitude** speed-up and **massive memory savings**. ## Bundle Size & Performance **Tiny footprint**: Core library is only **~8KB** (gzipped: ~3KB). **Works everywhere**: - ✅ **Node.js** (v14+) - ✅ **Browser** (ES modules) - ✅ **Web Workers** (thread-safe) ## Quick Start ```bash npm install distance-sensitive-hash ``` ```javascript import { encodeE2LSH, estimateSimilarity } from 'distance-sensitive-hash'; const productA = [0.8, 0.2, 0.9, /*...512 dims...*/ 0.1, 0.7]; const productB = [0.7, 0.3, 0.8, /*...512 dims...*/ 0.2, 0.6]; // 1. Hash once (64 integers) const fpA = encodeE2LSH(productA, 64); const fpB = encodeE2LSH(productB, 64); // 2. Compare in <1 μs const sim = estimateSimilarity(fpA, fpB); console.log('Similarity ≈', sim); // 0.0 – 1.0 ``` ## Two Algorithms, One API | Name | Function | When to Pick | |---|---|---| | **CS-E2LSH** (default) | `encodeCSE2LSH()` | **3× faster encoding** (recommended) | | **HCS-E2LSH** | `encodeHCSE2LSH()` | **Better accuracy for medium distances** | Pick based on your use case - CS for speed, HCS for better distance estimation. ## API Cheatsheet ```javascript encodeE2LSH(vector, bits, seed = 42, variant = 'cs', w = 4.0, order = 2) // vector: number[] – your data // bits: int – signature size (16-512 recommended) // seed: int – reproducibility // variant: string – 'cs' or 'hcs' // w: float – width parameter (4.0 default) // order: int – tensor order for HCS (2-4) // returns: Int32Array – compact fingerprint estimateSimilarity(fpA, fpB, w = 4) // fpA, fpB: Int32Array – fingerprints // w: float – width parameter used in encoding // returns: float 0-1 – Euclidean similarity estimate ``` ## Benchmarks | Task | Ops/sec | vs Vanilla | Accuracy | |---|---|---|---| | **CS-E2LSH encode** (512d→64-int) | 66,701 sigs/sec | — | 85-95% | | **CS-E2LSH compare** | 8.7M ops/sec | **9.5× faster** | ±0.15 @ ≥0.7 | | **HCS-E2LSH encode** (512d→64-int) | 7,857 sigs/sec | — | 85-95% | | **HCS-E2LSH compare** | 8.8M ops/sec | **9.6× faster** | ±0.13 @ ≥0.7 | | Vanilla Euclidean | 915k ops/sec | baseline | exact | ### Performance by Signature Size | Size | Generation | Comparison | Memory | Accuracy | |------|------------|------------|--------|----------| | 16 | 50.8k/s | 9.3M/s | 64B | ±0.10 | | 32 | 81.5k/s | 14.6M/s | 128B | ±0.13 | | 64 | 66.7k/s | 8.7M/s | 256B | ±0.15 | | 128 | 46.4k/s | 4.0M/s | 512B | ±0.13 | | 256 | 15.5k/s | 2.0M/s | 1KB | ±0.09 | ### Distance Accuracy Examples | True Distance | CS-E2LSH Error | HCS-E2LSH Error | |---------------|----------------|-----------------| | 0.377 (close) | ±0.064 (17%) | ±0.171 (45%) | | 1.405 (medium)| ±0.543 (39%) | ±0.318 (23%) | | 3.443 (far) | ±0.466 (14%) | ±0.231 (7%) | ## Memory Savings **Massive reduction** compared to full vector storage: | Vector Size | Full Vector | 64-int Signature | Reduction | |-------------|-------------|------------------|-----------| | 512-d float | 2,048 bytes | 256 bytes | **8× smaller** | | 1024-d float | 4,096 bytes | 256 bytes | **16× smaller** | | 2048-d float | 8,192 bytes | 256 bytes | **32× smaller** | **Real-world impact:** - **1M product vectors** (512-d): 2GB → 256MB (**87.5% reduction**) - **10M user vectors** (1024-d): 40GB → 2.5GB (**93.75% reduction**) - **100M image vectors** (2048-d): 800GB → 25GB (**96.9% reduction**) ## When to Use / Not Use ✅ **Use for:** - Large datasets (≥10k items) - Real-time recommendations - Memory-constrained environments - Approximate nearest neighbor search - Pre-filtering before exact comparisons ❌ **Don't use for:** - Exact similarity requirements - Very small datasets (<100 items) - Cosine similarity (use [SimilaritySensitiveHash](https://npm.im/similarity-sensitive-hash)) - Jaccard similarity (use [Grouped-OPH](https://npm.im/grouped-oph)) ## Academic Roots Implements **Count-Sketch E2LSH (CS-E2LSH)** and **Higher-Order Count-Sketch E2LSH (HCS-E2LSH)** from: > [Verma & Pratap, *Improving E2LSH with Count-Sketch and Higher-Order Count-Sketch*, arXiv:2503.06737](https://arxiv.org/abs/2503.06737) Built on the classic LSH framework (Gionis, Indyk, Motwani, VLDB 1999) and Euclidean LSH (Datar et al., SoCG 2004). ## License MIT