sentence2simvecjs
Version:
Vector-based sentence similarity (0.0-1.0) + embedding export. JavaScript implementation inspired by PINTO0309/sentence2simvec
587 lines (452 loc) • 17.1 kB
Markdown
# sentence2simvecjs
Vector-based sentence similarity (0.0–1.0) + embedding export. JavaScript implementation inspired by [PINTO0309/sentence2simvec](https://github.com/PINTO0309/sentence2simvec).
https://github.com/user-attachments/assets/4738b015-ef68-4503-aa51-a467754d7081
## Features
- **Dice's Coefficient**: Fast surface-level text similarity using n-gram analysis
- **Transformer Embeddings**: Semantic similarity using sentence-transformers/all-MiniLM-L6-v2
- **Embedding Cache**: Pre-compute and cache embeddings for fast similarity search
- **Corpus Management**: Load and search through large text collections efficiently
- **Batch Similarity**: Calculate similarities against entire corpus at once
- **Benchmarking**: Compare performance and accuracy between methods
- **Electron App**: Built-in GUI for interactive benchmarking
- **Cross-platform**: Works in Node.js and Electron (main & renderer processes)
## Installation
```bash
npm install sentence2simvecjs
```
## Usage
### As a Library
```javascript
const {
diceCoefficient,
embeddingSimilarity,
runBenchmark,
initializeEmbeddingModel
} = require('sentence2simvecjs');
// Simple Dice's Coefficient
const diceScore = diceCoefficient("Hello world", "Hello there");
console.log(diceScore); // 0.5
// Embedding similarity (async)
async function example() {
// Initialize model once (optional, will auto-init on first use)
await initializeEmbeddingModel();
const result = await embeddingSimilarity("Hello world", "Hello there");
console.log(result.score); // 0.7234
console.log(result.executionTime); // 123.45 ms
}
// Run benchmark comparison
async function benchmark() {
const result = await runBenchmark("Hello world", "Hello there", {
ngramSize: 3,
preloadModel: true
});
console.log('Dice Score:', result.diceResult.score);
console.log('Embedding Score:', result.embeddingResult.score);
console.log('Speed ratio:', result.embeddingResult.executionTime / result.diceResult.executionTime);
}
```
### With Embedding Cache
```javascript
const { EmbeddingCache, CorpusManager } = require('sentence2simvecjs');
// Create embedding cache
const cache = new EmbeddingCache({
persistToDisk: true,
cacheDir: './embeddings'
});
// Add texts to cache
await cache.addText('Machine learning is awesome');
await cache.addTextsFromFile('corpus.txt');
await cache.addTextsFromJSON('data.json', 'content');
// Find similar texts
const similar = await cache.findSimilar('Deep learning', 5);
// Batch similarity calculation
const scores = await cache.batchSimilarity('Neural networks');
```
### With Corpus Manager
```javascript
const corpus = new CorpusManager({
enableDiceCache: true,
enableEmbeddingCache: true
});
// Load corpus
await corpus.loadFromFile('documents.txt');
await corpus.addItems([
{ text: 'First document', id: 'doc1' },
{ text: 'Second document', id: 'doc2' }
]);
// Search using both methods
const results = await corpus.search('query text', 'both', 10);
// Batch similarity for entire corpus
const allScores = await corpus.batchSimilarity('query text', 'embedding');
```
### As an Electron App
```bash
# Clone the repository
git clone https://github.com/your-username/sentence2simvecjs
cd sentence2simvecjs
# Install dependencies
npm install
# Build and run
npm start
```
## API
### `diceCoefficient(text1: string, text2: string, ngramSize?: number): number`
Calculate Dice's coefficient between two texts using n-grams.
- `text1`, `text2`: Input texts to compare
- `ngramSize`: Size of n-grams (default: 3)
- Returns: Similarity score between 0.0 and 1.0
### `embeddingSimilarity(text1: string, text2: string): Promise<Result>`
Calculate semantic similarity using transformer embeddings.
- Returns: Object with `score`, `embedding1`, `embedding2`, and `executionTime`
### `runBenchmark(text1: string, text2: string, options?: Options): Promise<ComparisonResult>`
Run both similarity methods and compare performance.
- `options.ngramSize`: N-gram size for Dice's coefficient
- `options.preloadModel`: Whether to preload the transformer model
### `EmbeddingCache`
Pre-compute and cache embeddings for fast retrieval.
- `addText(text, id?, metadata?)`: Add single text to cache
- `addTexts(texts)`: Add multiple texts
- `addTextsFromFile(filePath)`: Load texts from file
- `findSimilar(query, topK, threshold?)`: Find similar cached texts
- `batchSimilarity(query)`: Get all similarity scores
### `CorpusManager`
Manage large text collections with both Dice and embedding methods.
- `addItem(text, id?, metadata?)`: Add text to corpus
- `loadFromFile(filePath, format)`: Load corpus from file
- `search(query, method, topK)`: Search corpus
- `batchSimilarity(query, method)`: Calculate all similarities
## Performance
- **Dice's Coefficient**: ~0.1ms per comparison
- **Transformer Embeddings**: ~50-200ms per comparison (after model initialization)
- **Cached Embeddings**: <1ms per comparison (after initial computation)
Initial model loading takes 1-3 seconds depending on hardware.
## Cache Storage
### Storage Options
The new `EmbeddingCacheV2` supports multiple storage backends:
1. **File System** (Node.js)
2. **LocalStorage** (Browser)
3. **Memory** (Both environments)
```javascript
// File storage (Node.js)
const fileCache = new EmbeddingCacheV2({
storageType: 'file',
cacheDir: './embeddings'
});
// LocalStorage (Browser)
const browserCache = new EmbeddingCacheV2({
storageType: 'localStorage',
storagePrefix: 'myapp_embeddings_',
maxItems: 1000 // Limit items to prevent quota issues
});
// Memory storage (default)
const memoryCache = new EmbeddingCacheV2({
storageType: 'memory'
});
// Custom storage adapter
const customCache = new EmbeddingCacheV2({
storageAdapter: myCustomAdapter // Implement StorageAdapter interface
});
```
### Browser LocalStorage Example
```html
<script type="module">
import { EmbeddingCacheV2, initializeEmbeddingModel } from 'sentence2simvecjs';
async function setupBrowserCache() {
await initializeEmbeddingModel();
const cache = new EmbeddingCacheV2({
storageType: 'localStorage',
storagePrefix: 'embeddings_',
maxItems: 500 // Prevent localStorage quota exceeded
});
// Add texts
await cache.addText('Example text');
// Find similar
const results = await cache.findSimilar('Query text', 5);
// Check storage usage
const info = await cache.getStorageInfo();
console.log(`Using ${info.estimatedSize / 1024}KB of localStorage`);
}
</script>
```
### Legacy Cache (File-only)
The original `EmbeddingCache` still works for backward compatibility:
```javascript
// Original file-based cache
const cache = new EmbeddingCache({
persistToDisk: true,
cacheDir: '/path/to/my/cache'
});
```
### Cache File Format
The cache is stored as JSON with the following structure:
```jsonc
[
{
"id": "unique_id",
"text": "Original text",
"embedding": [0.123, -0.456, ...], // 384-dimensional array
"metadata": { /* optional metadata */ }
}
]
```
### Cache Management
```javascript
// Clear all cache (works with all storage types)
await cache.clear(); // Removes all cached embeddings
// Remove specific item
await cache.remove('specific_id');
// Export/Import (works with all storage types)
const jsonData = await cache.exportToJSON();
await cache.importFromJSON(jsonData);
// Check storage info
const info = await cache.getStorageInfo();
console.log(`Storage type: ${info.type}`);
console.log(`Items: ${info.itemCount}`);
console.log(`Size: ${info.estimatedSize} bytes`);
```
### Clearing Cache Safely
The `clear()` method removes all cached embeddings:
- **LocalStorage**: Only removes items with the specified prefix
- **File System**: Deletes the cache directory contents
- **Memory**: Clears the in-memory Map
```javascript
// LocalStorage example - only clears items with 'myapp_' prefix
const cache = new EmbeddingCacheV2({
storageType: 'localStorage',
storagePrefix: 'myapp_' // Only 'myapp_*' keys will be cleared
});
await cache.clear(); // Other localStorage data remains untouched
// Confirm deletion
const remaining = await cache.size();
console.log(`Items after clear: ${remaining}`); // Should be 0
```
### Storage Limitations
- **LocalStorage**: ~5-10MB limit in most browsers
- **File System**: Limited by disk space
- **Memory**: Limited by available RAM
Use `maxItems` option to prevent storage overflow:
```javascript
const cache = new EmbeddingCacheV2({
storageType: 'localStorage',
maxItems: 500 // Automatically removes oldest items
});
```
### Model Storage in Browser
When using @xenova/transformers in the browser, the model files are stored separately from your embedding cache:
#### Where Models are Stored
- **Location**: Browser's Cache Storage API (not localStorage)
- **Path**: Accessible via DevTools → Application → Cache Storage → `transformers-cache`
- **Size**: ~25MB for all-MiniLM-L6-v2 model
- **Persistence**: Survives page reloads, cleared with browser cache
#### Viewing Cached Models
1. Open DevTools (F12)
2. Go to Application (Chrome) or Storage (Firefox) tab
3. Expand "Cache Storage"
4. Look for `transformers-cache` or similar
#### Model Cache vs Embedding Cache
- **Model Cache**: Stores the AI model files (Cache Storage API)
- **Embedding Cache**: Stores computed embeddings (localStorage/file/memory)
#### Clearing Model Cache
```javascript
// Clear transformer model cache
caches.keys().then(names => {
names.forEach(name => {
if (name.includes('transformers')) {
caches.delete(name);
}
});
});
// Clear embedding cache (your computed results)
await cache.clear();
```
## Browser Usage
To use in a browser environment:
1. Build the browser bundle:
```bash
npm run build:browser
```
2. Serve the files using a local server (to avoid CORS issues):
```bash
npm run serve
# Or use any static file server
```
3. Access the test pages:
- **Dice coefficient only**: `http://localhost:8000/src/browser/test-dice-only.html`
- **Full test with embeddings**: `http://localhost:8000/src/browser/test-localstorage.html`
### Test Page Usage Guide
**Note**: The embedding model initialization may take 10-30 seconds on first load as it downloads the model files (~25MB) from Hugging Face. The Dice-only test page works immediately without any model download.
The test page provides an interactive interface to test the LocalStorage cache functionality:
#### 1. **Add Text to Cache**
- **Text input**: Enter any sentence or paragraph you want to cache
- Example: "Machine learning is a subset of artificial intelligence"
- **Optional ID**: Provide a custom ID, or leave blank for auto-generated ID
- Example: "ml_definition"
<img width="821" height="589" alt="20250722155717" src="https://github.com/user-attachments/assets/8b548691-8703-468a-8c78-24d008fa15e0" />
#### 2. **Bulk Add**
- Add multiple texts at once (one per line):
```
Natural language processing enables computers to understand text
Deep learning models can learn complex patterns
Neural networks are inspired by the human brain
JavaScript is a programming language for web development
React is a library for building user interfaces
```
#### 3. **Find Similar**
- Enter a query to find similar cached texts:
- Example: "AI and machine learning"
- Example: "Web development frameworks"
- Shows top 5 most similar texts with similarity scores (0.0-1.0)
<img width="814" height="1033" alt="20250722155739" src="https://github.com/user-attachments/assets/3b857213-10c7-4751-9018-364ec1e6cdd3" />
#### 4. **Cache Data**
- **Export Format**: JSON file containing all cached embeddings
- **File Structure**:
```jsonc
[
{
"id": "text_1077264583", // Unique identifier (auto-generated or custom)
"text": "こんにちは", // Original text
"embedding": [ // 384-dimensional vector from all-MiniLM-L6-v2
-0.10119643807411194,
// ... (382 more values)
-0.008699539117515087
],
"timestamp": 1753166234369 // Unix timestamp when cached
},
{
"id": "text_1712359701",
"text": "はじめまして",
"embedding": [
-0.031796280294656754,
// ... (382 more values)
-0.005393804516643286
],
"timestamp": 1753166261449
},
{
"id": "text_6942345",
"text": "今日はいい天気ですね。",
"embedding": [
0.03111492656171322,
// ... (382 more values)
-0.012813657522201538
],
"timestamp": 1753166295569
},
{
"id": "text_2137068100",
"text": "Hello.",
"embedding": [
-0.09045851230621338,
// ... (382 more values)
0.015684669837355614
],
"timestamp": 1753167371990
},
{
"id": "text_1654144361",
"text": "Hello. Good morning.",
"embedding": [
-0.025240488350391388,
// ... (382 more values)
0.00397441117092967
],
"timestamp": 1753167383761
}
]
```
- **Field Descriptions**:
- `id`: Unique identifier for each cached text
- Auto-generated format: `text_[hash]` (e.g., "text_1077264583")
- Custom format: User-provided ID (e.g., "ml_definition")
- `text`: The original text string that was embedded
- `embedding`: 384-dimensional Float32Array from all-MiniLM-L6-v2 model
- Normalized vector (L2 norm = 1.0)
- Used for cosine similarity calculations
- `timestamp`: Unix timestamp (milliseconds since epoch)
- Used for cache management (oldest items removed when maxItems is reached)
- **File Size**: Approximately 3-4KB per cached text (including JSON overhead)
#### 5. **Storage Management**
- **Storage Info**: Shows current storage usage and item count
- **Show/Hide Cached Texts**: Toggle button to display all cached texts with their IDs
- Displays in a scrollable list (max height: 200px)
- Shows ID and full text for each cached item
- Automatically updates when texts are added/removed
- **Clear Cache**: Removes all cached embeddings (with confirmation)
- **Export/Import**: Save cache to JSON file or load from file
#### 6. **Performance Metrics**
- **Search Time Display**: Shows processing time in milliseconds for each search
- Format: "Similar Texts (Found in XX.XXms):"
- Measures the complete `findSimilar` execution time
- Helps understand the performance benefit of cached embeddings
#### Example Workflow:
1. Add several texts using "Bulk Add" (copy the example above)
2. Click "Show Cached Texts" to view all stored items
3. Search for "artificial intelligence" to find AI-related texts
- Note the search time (e.g., "Found in 23.45ms")
4. Search for "programming" to find coding-related texts
5. Export your cache to save the embeddings
6. Clear cache and import to restore
### Including in Your Web Page
```html
<script src="path/to/sentence2simvecjs/dist/browser.js"></script>
<script>
const { EmbeddingCacheV2, initializeEmbeddingModel } = window.sentence2simvecjs;
async function init() {
await initializeEmbeddingModel();
const cache = new EmbeddingCacheV2({
storageType: 'localStorage'
});
// Use the cache...
}
</script>
```
**Note**: Direct file:// access will cause CORS errors. Always serve through HTTP/HTTPS.
## OffscreenCanvas Visualization
This library includes high-performance visualization components using OffscreenCanvas and Web Workers for non-blocking rendering.
### Features
- **OffscreenCanvas Rendering**: Moves canvas operations to Web Worker threads
- **Non-blocking UI**: Heavy visualizations don't freeze the main thread
- **Multiple Chart Types**:
- Heatmap: Similarity matrix visualization
- Bar Chart: Performance comparison (Dice vs Embedding times)
- Scatter Plot: Score correlation analysis
### Usage
```javascript
import { SimilarityVisualization } from 'sentence2simvecjs/renderer';
// In your React component
<SimilarityVisualization
data={benchmarkResults}
type="heatmap" // or "barchart" or "scatter"
width={600}
height={400}
title="Similarity Matrix"
/>
```
### Browser Support
OffscreenCanvas is supported in:
- Chrome 69+
- Firefox 105+
- Edge 79+
- Safari 16.4+ (partial support)
The visualization component automatically falls back to main thread rendering for unsupported browsers.
### Test Page
To test OffscreenCanvas visualization:
```bash
npm run serve
# Navigate to http://localhost:8000/src/browser/test-offscreencanvas.html
```
The test page includes:
- Browser compatibility check
- Interactive visualization demos
- Performance benchmarking
- Stress testing with 1000+ data points
### Performance Benefits
Using OffscreenCanvas provides:
- **60fps UI**: Main thread remains responsive during heavy rendering
- **Parallel Processing**: Multiple visualizations can render simultaneously
- **Better UX**: No freezing when processing large datasets
- **Scalability**: Handle thousands of data points smoothly
## License
Apache-2.0
## Credits
Inspired by [PINTO0309/sentence2simvec](https://github.com/PINTO0309/sentence2simvec)