UNPKG

@jaehyun-ko/speaker-verification

Version:

Real-time speaker verification in the browser using NeXt-TDNN models

262 lines (196 loc) 8.66 kB
# NeXt-TDNN Speaker Verification for Web Real-time speaker verification in the browser using NeXt-TDNN models. Compare two audio samples to determine if they're from the same speaker. ## 🎯 Live Demo Try it now: [https://jaehyun-ko.github.io/node-speaker-verification/](https://jaehyun-ko.github.io/node-speaker-verification/) Simple and intuitive speaker verification: - 🎤 Record audio directly from microphone - 📁 Upload audio files - 🔍 Get similarity score instantly ## 🚀 Quick Start (Simple API) ### API Methods - `initialize(model, options?)` - Initialize with a model - `compareAudio(audio1, audio2)` - Compare two audio samples - `getEmbedding(audio)` - Extract speaker embedding from audio - `compareEmbeddings(embedding1, embedding2)` - Compare pre-computed embeddings - `cleanup()` - Release resources ### CDN Usage (Simplest - Just 3 Lines!) ```html <!DOCTYPE html> <html> <head> <!-- IMPORTANT: Load ONNX Runtime first --> <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.16.3/dist/ort.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/@jaehyun-ko/speaker-verification@5.0.0/dist/speaker-verification.js"></script> </head> <body> <input type="file" id="audio1" accept="audio/*"> <input type="file" id="audio2" accept="audio/*"> <button onclick="compareSpeakers()">Compare</button> <script> // Create verifier instance const verifier = new SpeakerVerification(); async function compareSpeakers() { // 1. Initialize (only needed once) await verifier.initialize('standard-256'); // 2. Get audio files const file1 = document.getElementById('audio1').files[0]; const file2 = document.getElementById('audio2').files[0]; // 3. Compare! That's it! const result = await verifier.compareAudio(file1, file2); console.log('Similarity:', (result.similarity * 100).toFixed(1) + '%'); console.log('Same speaker?', result.similarity > 0.5); // You decide the threshold! } </script> </body> </html> ``` ### NPM Installation ```bash # Install both ONNX Runtime and the speaker verification library npm install onnxruntime-web @jaehyun-ko/speaker-verification ``` ```javascript import * as ort from 'onnxruntime-web'; import { SpeakerVerification } from '@jaehyun-ko/speaker-verification'; // Optional: Configure ONNX Runtime WASM paths if needed // ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.16.3/dist/'; // Create instance const verifier = new SpeakerVerification(); // Initialize with model (auto-downloads from Hugging Face) await verifier.initialize('standard-256'); // or 'mobile-128' for smaller/faster // Compare any audio format (File, Blob, ArrayBuffer, Float32Array) const result = await verifier.compareAudio(audio1, audio2); console.log(result); // { // similarity: 0.92, // 0.0 to 1.0 (higher = more similar) // processingTime: 523 // milliseconds // } // You decide what threshold to use const isSameSpeaker = result.similarity > 0.5; // Common threshold: 0.5 ``` ### Available Models ```javascript // Standard models (best accuracy) 'standard-256' // 28MB - Recommended 'standard-128' // 7.5MB - Faster 'standard-192' // 16MB 'standard-384' // 32MB - Highest accuracy // Mobile models (optimized for size/speed) 'mobile-128' // 5MB - Smallest 'mobile-256' // 20MB - Best mobile balance ``` ## 📱 Microphone Recording ```javascript // With the simple API, just pass the recorded blob const verifier = new SpeakerVerification(); await verifier.initialize('standard-256'); // Record audio using browser API const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const mediaRecorder = new MediaRecorder(stream); const chunks = []; mediaRecorder.ondataavailable = (e) => chunks.push(e.data); mediaRecorder.onstop = async () => { const audioBlob = new Blob(chunks, { type: 'audio/webm' }); // Compare with another audio const result = await verifier.compareAudio(audioBlob, anotherAudio); console.log('Similarity:', result.similarity); }; mediaRecorder.start(); setTimeout(() => mediaRecorder.stop(), 3000); // Record for 3 seconds ``` ## 🎛️ Available Models All models are hosted on [Hugging Face](https://huggingface.co/jaehyun-ko/next-tdnn-onnx). ### Simple API Model Keys | Key | Size | Channels | Description | |-----|------|----------|-------------| | `standard-256` | 28MB | 256 | **Recommended** - Best balance | | `standard-128` | 7.5MB | 128 | Compact, faster processing | | `standard-192` | 16MB | 192 | Medium size and accuracy | | `standard-384` | 32MB | 384 | Highest accuracy | | `mobile-128` | 5MB | 128 | Smallest, mobile-optimized | | `mobile-256` | 20MB | 256 | Best mobile balance | ### Full Model Names (for advanced usage) | Model | Size | Description | |-------|------|-------------| | `NeXt_TDNN_C256_B3_K65_7_cosine` | 28MB | Standard 256-channel | | `NeXt_TDNN_C128_B3_K65_7_cosine` | 7.5MB | Compact 128-channel | | `NeXt_TDNN_C192_B1_K65_7_cosine` | 16MB | Medium 192-channel | | `NeXt_TDNN_C384_B1_K65_7_cosine` | 32MB | Large 384-channel | | `NeXt_TDNN_light_C128_B3_K65_7_cosine` | 5MB | Mobile 128-channel | | `NeXt_TDNN_light_C256_B3_K65_7_cosine` | 20MB | Mobile 256-channel | ## 📊 Understanding Results - **Similarity Score**: 0.0 to 1.0 (higher = more similar) - **Recommended Threshold**: 0.5 - **Adjust threshold** based on your needs: - Higher threshold (0.7+) = More strict, fewer false positives - Lower threshold (0.3-) = More permissive, fewer false negatives ## 🛠️ Advanced Usage ### Custom Model Loading with Simple API ```javascript // Load custom model from ArrayBuffer const modelData = await fetch('path/to/custom-model.onnx').then(r => r.arrayBuffer()); const verifier = new SpeakerVerification(); await verifier.initialize('standard-256', { modelData }); // Or disable caching for development await verifier.initialize('standard-256', { cacheModel: false }); ``` ### Batch Processing ```javascript const verifier = new SpeakerVerification(); await verifier.initialize('standard-256'); // Compare multiple audio pairs const results = []; for (let i = 0; i < audioFiles.length - 1; i++) { const result = await verifier.compareAudio(audioFiles[i], audioFiles[i + 1]); results.push(result); } // Get average similarity const avgSimilarity = results.reduce((sum, r) => sum + r.similarity, 0) / results.length; ``` ### Working with Embeddings You can now extract and compare speaker embeddings directly: ```javascript const verifier = new SpeakerVerification(); await verifier.initialize('standard-256'); // Extract embeddings from audio const embedding1 = await verifier.getEmbedding(audio1); const embedding2 = await verifier.getEmbedding(audio2); console.log('Embedding 1:', embedding1); // { // embedding: Float32Array(192), // Normalized speaker vector // processingTime: 245 // milliseconds // } // Compare pre-computed embeddings const similarity = verifier.compareEmbeddings(embedding1.embedding, embedding2.embedding); console.log('Similarity:', similarity); // 0.0 to 1.0 // Store embeddings for later use const embeddingData = Array.from(embedding1.embedding); // Convert to regular array for storage localStorage.setItem('speaker1', JSON.stringify(embeddingData)); // Load and use stored embeddings const storedData = JSON.parse(localStorage.getItem('speaker1')); const storedEmbedding = new Float32Array(storedData); const similarity2 = verifier.compareEmbeddings(storedEmbedding, embedding2.embedding); ``` This is useful for: - Building speaker databases - Caching embeddings for performance - Analyzing speaker characteristics - Custom similarity metrics ## 📝 License Apache License 2.0 ## 🤝 Credits Based on [NeXt-TDNN](https://github.com/dmlguq456/NeXt_TDNN_ASV) architecture for speaker verification. ## 📚 Citation If you use this library in your research, please cite: ```bibtex @INPROCEEDINGS{10447037, author={Heo, Hyun-Jun and Shin, Ui-Hyeop and Lee, Ran and Cheon, YoungJu and Park, Hyung-Min}, booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification}, year={2024}, volume={}, number={}, pages={11186-11190}, keywords={Convolution;Speech recognition;Transformers;Acoustics;Task analysis;Speech processing;speaker recognition;speaker verification;TDNN;ConvNeXt;multi-scale}, doi={10.1109/ICASSP48485.2024.10447037}} ```