UNPKG

sentence2simvecjs

Version:

Vector-based sentence similarity (0.0-1.0) + embedding export. JavaScript implementation inspired by PINTO0309/sentence2simvec

587 lines (452 loc) 17.1 kB
# sentence2simvecjs Vector-based sentence similarity (0.0–1.0) + embedding export. JavaScript implementation inspired by [PINTO0309/sentence2simvec](https://github.com/PINTO0309/sentence2simvec). https://github.com/user-attachments/assets/4738b015-ef68-4503-aa51-a467754d7081 ## Features - **Dice's Coefficient**: Fast surface-level text similarity using n-gram analysis - **Transformer Embeddings**: Semantic similarity using sentence-transformers/all-MiniLM-L6-v2 - **Embedding Cache**: Pre-compute and cache embeddings for fast similarity search - **Corpus Management**: Load and search through large text collections efficiently - **Batch Similarity**: Calculate similarities against entire corpus at once - **Benchmarking**: Compare performance and accuracy between methods - **Electron App**: Built-in GUI for interactive benchmarking - **Cross-platform**: Works in Node.js and Electron (main & renderer processes) ## Installation ```bash npm install sentence2simvecjs ``` ## Usage ### As a Library ```javascript const { diceCoefficient, embeddingSimilarity, runBenchmark, initializeEmbeddingModel } = require('sentence2simvecjs'); // Simple Dice's Coefficient const diceScore = diceCoefficient("Hello world", "Hello there"); console.log(diceScore); // 0.5 // Embedding similarity (async) async function example() { // Initialize model once (optional, will auto-init on first use) await initializeEmbeddingModel(); const result = await embeddingSimilarity("Hello world", "Hello there"); console.log(result.score); // 0.7234 console.log(result.executionTime); // 123.45 ms } // Run benchmark comparison async function benchmark() { const result = await runBenchmark("Hello world", "Hello there", { ngramSize: 3, preloadModel: true }); console.log('Dice Score:', result.diceResult.score); console.log('Embedding Score:', result.embeddingResult.score); console.log('Speed ratio:', result.embeddingResult.executionTime / result.diceResult.executionTime); } ``` ### With Embedding Cache ```javascript const { EmbeddingCache, CorpusManager } = require('sentence2simvecjs'); // Create embedding cache const cache = new EmbeddingCache({ persistToDisk: true, cacheDir: './embeddings' }); // Add texts to cache await cache.addText('Machine learning is awesome'); await cache.addTextsFromFile('corpus.txt'); await cache.addTextsFromJSON('data.json', 'content'); // Find similar texts const similar = await cache.findSimilar('Deep learning', 5); // Batch similarity calculation const scores = await cache.batchSimilarity('Neural networks'); ``` ### With Corpus Manager ```javascript const corpus = new CorpusManager({ enableDiceCache: true, enableEmbeddingCache: true }); // Load corpus await corpus.loadFromFile('documents.txt'); await corpus.addItems([ { text: 'First document', id: 'doc1' }, { text: 'Second document', id: 'doc2' } ]); // Search using both methods const results = await corpus.search('query text', 'both', 10); // Batch similarity for entire corpus const allScores = await corpus.batchSimilarity('query text', 'embedding'); ``` ### As an Electron App ```bash # Clone the repository git clone https://github.com/your-username/sentence2simvecjs cd sentence2simvecjs # Install dependencies npm install # Build and run npm start ``` ## API ### `diceCoefficient(text1: string, text2: string, ngramSize?: number): number` Calculate Dice's coefficient between two texts using n-grams. - `text1`, `text2`: Input texts to compare - `ngramSize`: Size of n-grams (default: 3) - Returns: Similarity score between 0.0 and 1.0 ### `embeddingSimilarity(text1: string, text2: string): Promise<Result>` Calculate semantic similarity using transformer embeddings. - Returns: Object with `score`, `embedding1`, `embedding2`, and `executionTime` ### `runBenchmark(text1: string, text2: string, options?: Options): Promise<ComparisonResult>` Run both similarity methods and compare performance. - `options.ngramSize`: N-gram size for Dice's coefficient - `options.preloadModel`: Whether to preload the transformer model ### `EmbeddingCache` Pre-compute and cache embeddings for fast retrieval. - `addText(text, id?, metadata?)`: Add single text to cache - `addTexts(texts)`: Add multiple texts - `addTextsFromFile(filePath)`: Load texts from file - `findSimilar(query, topK, threshold?)`: Find similar cached texts - `batchSimilarity(query)`: Get all similarity scores ### `CorpusManager` Manage large text collections with both Dice and embedding methods. - `addItem(text, id?, metadata?)`: Add text to corpus - `loadFromFile(filePath, format)`: Load corpus from file - `search(query, method, topK)`: Search corpus - `batchSimilarity(query, method)`: Calculate all similarities ## Performance - **Dice's Coefficient**: ~0.1ms per comparison - **Transformer Embeddings**: ~50-200ms per comparison (after model initialization) - **Cached Embeddings**: <1ms per comparison (after initial computation) Initial model loading takes 1-3 seconds depending on hardware. ## Cache Storage ### Storage Options The new `EmbeddingCacheV2` supports multiple storage backends: 1. **File System** (Node.js) 2. **LocalStorage** (Browser) 3. **Memory** (Both environments) ```javascript // File storage (Node.js) const fileCache = new EmbeddingCacheV2({ storageType: 'file', cacheDir: './embeddings' }); // LocalStorage (Browser) const browserCache = new EmbeddingCacheV2({ storageType: 'localStorage', storagePrefix: 'myapp_embeddings_', maxItems: 1000 // Limit items to prevent quota issues }); // Memory storage (default) const memoryCache = new EmbeddingCacheV2({ storageType: 'memory' }); // Custom storage adapter const customCache = new EmbeddingCacheV2({ storageAdapter: myCustomAdapter // Implement StorageAdapter interface }); ``` ### Browser LocalStorage Example ```html <script type="module"> import { EmbeddingCacheV2, initializeEmbeddingModel } from 'sentence2simvecjs'; async function setupBrowserCache() { await initializeEmbeddingModel(); const cache = new EmbeddingCacheV2({ storageType: 'localStorage', storagePrefix: 'embeddings_', maxItems: 500 // Prevent localStorage quota exceeded }); // Add texts await cache.addText('Example text'); // Find similar const results = await cache.findSimilar('Query text', 5); // Check storage usage const info = await cache.getStorageInfo(); console.log(`Using ${info.estimatedSize / 1024}KB of localStorage`); } </script> ``` ### Legacy Cache (File-only) The original `EmbeddingCache` still works for backward compatibility: ```javascript // Original file-based cache const cache = new EmbeddingCache({ persistToDisk: true, cacheDir: '/path/to/my/cache' }); ``` ### Cache File Format The cache is stored as JSON with the following structure: ```jsonc [ { "id": "unique_id", "text": "Original text", "embedding": [0.123, -0.456, ...], // 384-dimensional array "metadata": { /* optional metadata */ } } ] ``` ### Cache Management ```javascript // Clear all cache (works with all storage types) await cache.clear(); // Removes all cached embeddings // Remove specific item await cache.remove('specific_id'); // Export/Import (works with all storage types) const jsonData = await cache.exportToJSON(); await cache.importFromJSON(jsonData); // Check storage info const info = await cache.getStorageInfo(); console.log(`Storage type: ${info.type}`); console.log(`Items: ${info.itemCount}`); console.log(`Size: ${info.estimatedSize} bytes`); ``` ### Clearing Cache Safely The `clear()` method removes all cached embeddings: - **LocalStorage**: Only removes items with the specified prefix - **File System**: Deletes the cache directory contents - **Memory**: Clears the in-memory Map ```javascript // LocalStorage example - only clears items with 'myapp_' prefix const cache = new EmbeddingCacheV2({ storageType: 'localStorage', storagePrefix: 'myapp_' // Only 'myapp_*' keys will be cleared }); await cache.clear(); // Other localStorage data remains untouched // Confirm deletion const remaining = await cache.size(); console.log(`Items after clear: ${remaining}`); // Should be 0 ``` ### Storage Limitations - **LocalStorage**: ~5-10MB limit in most browsers - **File System**: Limited by disk space - **Memory**: Limited by available RAM Use `maxItems` option to prevent storage overflow: ```javascript const cache = new EmbeddingCacheV2({ storageType: 'localStorage', maxItems: 500 // Automatically removes oldest items }); ``` ### Model Storage in Browser When using @xenova/transformers in the browser, the model files are stored separately from your embedding cache: #### Where Models are Stored - **Location**: Browser's Cache Storage API (not localStorage) - **Path**: Accessible via DevTools → Application → Cache Storage → `transformers-cache` - **Size**: ~25MB for all-MiniLM-L6-v2 model - **Persistence**: Survives page reloads, cleared with browser cache #### Viewing Cached Models 1. Open DevTools (F12) 2. Go to Application (Chrome) or Storage (Firefox) tab 3. Expand "Cache Storage" 4. Look for `transformers-cache` or similar #### Model Cache vs Embedding Cache - **Model Cache**: Stores the AI model files (Cache Storage API) - **Embedding Cache**: Stores computed embeddings (localStorage/file/memory) #### Clearing Model Cache ```javascript // Clear transformer model cache caches.keys().then(names => { names.forEach(name => { if (name.includes('transformers')) { caches.delete(name); } }); }); // Clear embedding cache (your computed results) await cache.clear(); ``` ## Browser Usage To use in a browser environment: 1. Build the browser bundle: ```bash npm run build:browser ``` 2. Serve the files using a local server (to avoid CORS issues): ```bash npm run serve # Or use any static file server ``` 3. Access the test pages: - **Dice coefficient only**: `http://localhost:8000/src/browser/test-dice-only.html` - **Full test with embeddings**: `http://localhost:8000/src/browser/test-localstorage.html` ### Test Page Usage Guide **Note**: The embedding model initialization may take 10-30 seconds on first load as it downloads the model files (~25MB) from Hugging Face. The Dice-only test page works immediately without any model download. The test page provides an interactive interface to test the LocalStorage cache functionality: #### 1. **Add Text to Cache** - **Text input**: Enter any sentence or paragraph you want to cache - Example: "Machine learning is a subset of artificial intelligence" - **Optional ID**: Provide a custom ID, or leave blank for auto-generated ID - Example: "ml_definition" <img width="821" height="589" alt="20250722155717" src="https://github.com/user-attachments/assets/8b548691-8703-468a-8c78-24d008fa15e0" /> #### 2. **Bulk Add** - Add multiple texts at once (one per line): ``` Natural language processing enables computers to understand text Deep learning models can learn complex patterns Neural networks are inspired by the human brain JavaScript is a programming language for web development React is a library for building user interfaces ``` #### 3. **Find Similar** - Enter a query to find similar cached texts: - Example: "AI and machine learning" - Example: "Web development frameworks" - Shows top 5 most similar texts with similarity scores (0.0-1.0) <img width="814" height="1033" alt="20250722155739" src="https://github.com/user-attachments/assets/3b857213-10c7-4751-9018-364ec1e6cdd3" /> #### 4. **Cache Data** - **Export Format**: JSON file containing all cached embeddings - **File Structure**: ```jsonc [ { "id": "text_1077264583", // Unique identifier (auto-generated or custom) "text": "こんにちは", // Original text "embedding": [ // 384-dimensional vector from all-MiniLM-L6-v2 -0.10119643807411194, // ... (382 more values) -0.008699539117515087 ], "timestamp": 1753166234369 // Unix timestamp when cached }, { "id": "text_1712359701", "text": "はじめまして", "embedding": [ -0.031796280294656754, // ... (382 more values) -0.005393804516643286 ], "timestamp": 1753166261449 }, { "id": "text_6942345", "text": "今日はいい天気ですね。", "embedding": [ 0.03111492656171322, // ... (382 more values) -0.012813657522201538 ], "timestamp": 1753166295569 }, { "id": "text_2137068100", "text": "Hello.", "embedding": [ -0.09045851230621338, // ... (382 more values) 0.015684669837355614 ], "timestamp": 1753167371990 }, { "id": "text_1654144361", "text": "Hello. Good morning.", "embedding": [ -0.025240488350391388, // ... (382 more values) 0.00397441117092967 ], "timestamp": 1753167383761 } ] ``` - **Field Descriptions**: - `id`: Unique identifier for each cached text - Auto-generated format: `text_[hash]` (e.g., "text_1077264583") - Custom format: User-provided ID (e.g., "ml_definition") - `text`: The original text string that was embedded - `embedding`: 384-dimensional Float32Array from all-MiniLM-L6-v2 model - Normalized vector (L2 norm = 1.0) - Used for cosine similarity calculations - `timestamp`: Unix timestamp (milliseconds since epoch) - Used for cache management (oldest items removed when maxItems is reached) - **File Size**: Approximately 3-4KB per cached text (including JSON overhead) #### 5. **Storage Management** - **Storage Info**: Shows current storage usage and item count - **Show/Hide Cached Texts**: Toggle button to display all cached texts with their IDs - Displays in a scrollable list (max height: 200px) - Shows ID and full text for each cached item - Automatically updates when texts are added/removed - **Clear Cache**: Removes all cached embeddings (with confirmation) - **Export/Import**: Save cache to JSON file or load from file #### 6. **Performance Metrics** - **Search Time Display**: Shows processing time in milliseconds for each search - Format: "Similar Texts (Found in XX.XXms):" - Measures the complete `findSimilar` execution time - Helps understand the performance benefit of cached embeddings #### Example Workflow: 1. Add several texts using "Bulk Add" (copy the example above) 2. Click "Show Cached Texts" to view all stored items 3. Search for "artificial intelligence" to find AI-related texts - Note the search time (e.g., "Found in 23.45ms") 4. Search for "programming" to find coding-related texts 5. Export your cache to save the embeddings 6. Clear cache and import to restore ### Including in Your Web Page ```html <script src="path/to/sentence2simvecjs/dist/browser.js"></script> <script> const { EmbeddingCacheV2, initializeEmbeddingModel } = window.sentence2simvecjs; async function init() { await initializeEmbeddingModel(); const cache = new EmbeddingCacheV2({ storageType: 'localStorage' }); // Use the cache... } </script> ``` **Note**: Direct file:// access will cause CORS errors. Always serve through HTTP/HTTPS. ## OffscreenCanvas Visualization This library includes high-performance visualization components using OffscreenCanvas and Web Workers for non-blocking rendering. ### Features - **OffscreenCanvas Rendering**: Moves canvas operations to Web Worker threads - **Non-blocking UI**: Heavy visualizations don't freeze the main thread - **Multiple Chart Types**: - Heatmap: Similarity matrix visualization - Bar Chart: Performance comparison (Dice vs Embedding times) - Scatter Plot: Score correlation analysis ### Usage ```javascript import { SimilarityVisualization } from 'sentence2simvecjs/renderer'; // In your React component <SimilarityVisualization data={benchmarkResults} type="heatmap" // or "barchart" or "scatter" width={600} height={400} title="Similarity Matrix" /> ``` ### Browser Support OffscreenCanvas is supported in: - Chrome 69+ - Firefox 105+ - Edge 79+ - Safari 16.4+ (partial support) The visualization component automatically falls back to main thread rendering for unsupported browsers. ### Test Page To test OffscreenCanvas visualization: ```bash npm run serve # Navigate to http://localhost:8000/src/browser/test-offscreencanvas.html ``` The test page includes: - Browser compatibility check - Interactive visualization demos - Performance benchmarking - Stress testing with 1000+ data points ### Performance Benefits Using OffscreenCanvas provides: - **60fps UI**: Main thread remains responsive during heavy rendering - **Parallel Processing**: Multiple visualizations can render simultaneously - **Better UX**: No freezing when processing large datasets - **Scalability**: Handle thousands of data points smoothly ## License Apache-2.0 ## Credits Inspired by [PINTO0309/sentence2simvec](https://github.com/PINTO0309/sentence2simvec)