@steelbrain/media-speech-detection-web
Version:
Production-ready speech detection using Silero VAD ONNX model for web browsers
175 lines (129 loc) • 6.04 kB
Markdown
Speech Detection using Silero VAD ONNX model for web browsers.
```bash
npm install @steelbrain/media-speech-detection-web
```
**Modern Bundler Support**: This package is fully compatible with modern bundlers (Webpack 5, Next.js, Vite, etc.). The ONNX model file is automatically detected and bundled - no manual setup or public folder configuration required.
## Quick Start
```typescript
import { speechFilter, preloadModel } from '@steelbrain/media-speech-detection-web';
import { ingestAudioStream, RECOMMENDED_AUDIO_CONSTRAINTS } from '@steelbrain/media-ingest-audio';
// Optional: Preload model during app initialization for faster first use
await preloadModel();
// Get microphone access
const mediaStream = await navigator.mediaDevices.getUserMedia({
audio: RECOMMENDED_AUDIO_CONSTRAINTS
});
// Create 16kHz audio stream
const audioStream = await ingestAudioStream(mediaStream);
// Filter audio to only speech chunks
const vadTransform = speechFilter({
onSpeechStart: () => console.log('🎤 Speech started'),
onSpeechEnd: () => console.log('🔇 Speech ended'),
threshold: 0.5
});
await audioStream
.pipeThrough(vadTransform)
.pipeTo(speechProcessor);
// Events-only (no audio output) using .tee() pattern
const [processStream, eventsStream] = audioStream.tee();
// Process audio on one branch
processStream.pipeTo(speechProcessor);
// Handle events on another branch without outputting audio
eventsStream.pipeThrough(speechFilter({
noEmit: true, // Don't emit audio chunks
onSpeechStart: () => console.log('🎤 Speech started'),
onSpeechEnd: () => console.log('🔇 Speech ended'),
onMisfire: () => console.log('⚠️ Short speech segment filtered')
}));
```
Preloads the Silero VAD ONNX model by fetching it into browser cache, eliminating network delay when speech detection is first used.
**Usage**: `await preloadModel()` - Call during app initialization for optimal performance.
Creates a TransformStream that filters audio, outputting only speech chunks. Use the `noEmit` option for events-only processing.
**Usage**: `audioStream.pipeThrough(speechFilter(options)).pipeTo(processor)`
```typescript
interface VADOptions {
// Event Handlers
onSpeechStart?: () => void;
onSpeechEnd?: (speechAudio: Float32Array) => void;
onMisfire?: () => void;
onError?: (error: Error) => void;
onDebugLog?: (message: string) => void;
// Detection Configuration
threshold?: number; // Speech detection threshold (0-1). Default: 0.5
minSpeechDurationMs?: number; // Minimum speech duration in ms. Default: 160ms
redemptionDurationMs?: number; // Grace period before confirming speech end. Default: 400ms
lookBackDurationMs?: number; // Lookback buffer for smooth speech start. Default: 384ms
// Stream Control
noEmit?: boolean; // Don't emit chunks, only trigger callbacks. Default: false
}
```
The package provides carefully tuned defaults that work well for most use cases:
| Parameter | Default | Purpose |
|-----------|---------|---------|
| `threshold` | `0.5` | Balanced speech detection |
| `minSpeechDurationMs` | `160ms` | Filters out very short sounds |
| `redemptionDurationMs` | `400ms` | Handles natural speech pauses |
| `lookBackDurationMs` | `384ms` | Captures natural audio context before speech |
```typescript
const vadTransform = speechFilter({
onSpeechStart: () => console.log('🎤 Speech started'),
onSpeechEnd: () => console.log('🔇 Speech ended'),
onError: (error) => console.error('VAD Error:', error),
onDebugLog: (message) => console.log('VAD Debug:', message),
threshold: 0.6
});
```
```typescript
// Preload model during app startup
await preloadModel();
// Complete pipeline: microphone → VAD → transcription
await audioStream
.pipeThrough(speechFilter({
onSpeechStart: () => showRecordingIndicator(),
onSpeechEnd: () => hideRecordingIndicator(),
threshold: 0.5
}))
.pipeThrough(transcriptionTransform)
.pipeTo(displayResults);
```
```typescript
// Preload model early in your application lifecycle
window.addEventListener('load', async () => {
try {
await preloadModel();
console.log('VAD model preloaded and cached');
} catch (error) {
console.warn('Failed to preload VAD model:', error);
}
});
```
1. **Silero VAD Model**: Uses the pre-trained Silero VAD ONNX model for production-ready accuracy
2. **Audio Processing**: Processes 16kHz mono audio in 512-sample windows (32ms frames)
3. **State Machine**: Implements a sophisticated state machine with speech/intermediate/silent states
4. **Lookback Buffer**: Maintains a buffer to capture speech starts smoothly
5. **Temporal Smoothing**: Uses configurable timing thresholds to prevent false triggers
6. **Web Streams**: Built on modern Web Streams API for optimal performance and composability
## Model Details
- **Model**: [Silero VAD v4.0](https://github.com/snakers4/silero-vad) (MIT License)
- **Input**: 16kHz mono audio, 512 samples per inference (32ms windows)
- **Output**: Speech probability (0-1) per window + internal LSTM state
- **Model Size**: ~2.3MB ONNX format
- **Performance**: <1ms inference time per chunk on modern browsers
- **Accuracy**: Enterprise-grade performance across diverse acoustic conditions
## Credits
This package uses the [Silero VAD](https://github.com/snakers4/silero-vad) model developed by Silero Team, licensed under MIT License. The model provides state-of-the-art speech detection with excellent performance across various languages and acoustic conditions.
## License
MIT License - See LICENSE file for details.
**Silero VAD Model**: MIT License (© Silero Team)