UNPKG

react-native-sherpa-onnx-offline-stt

Version:

React Native wrapper for sherpa-onnx offline speech-to-text with TEN-VAD and speaker diarization

341 lines (260 loc) 9.61 kB
# react-native-sherpa-onnx-offline-stt A React Native library for **offline speech-to-text** using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx). Runs entirely on-device with no internet connection required. ## Features - **Offline STT** - Speech recognition runs locally on the device - **Two modes**: Streaming (real-time) and Offline (VAD-triggered batch processing) - **TEN-VAD** - Voice Activity Detection for accurate speech segmentation - **Speaker Diarization** - Identify different speakers in conversation - **Speech Denoising** - GTCRN-based noise reduction - **Background Recording** - Continue recording when app is minimized - **Performance Metrics** - RTFx, processing time, confidence scores - **Streaming State** - Two-tier volatile/confirmed transcript updates ## Installation ```bash npm install react-native-sherpa-onnx-offline-stt # or yarn add react-native-sherpa-onnx-offline-stt ``` ### Android Add to your `android/app/build.gradle`: ```gradle android { packagingOptions { pickFirst '**/*.so' } } dependencies { implementation 'com.k2fsa.sherpa:sherpa-onnx:1.10.+' } ``` ### iOS ```bash cd ios && pod install ``` ## Models You need to download the models separately and place them on the device: ### STT Models - **Streaming**: [Zipformer French](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-fr-2023-04-14.tar.bz2) (~128MB) - **Offline**: [Parakeet TDT v3](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2) (~670MB) ### VAD Model - [TEN-VAD](https://github.com/ten-framework/TEN-VAD) - Included in the library ### Speaker Diarization (Optional) - [3D-Speaker](https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx) (~26MB) ### Denoiser (Optional) - [GTCRN](https://github.com/k2-fsa/sherpa-onnx/releases/download/speech-enhancement-models/gtcrn_simple.onnx) (~524KB) ## Usage ```typescript import STTManager from 'react-native-sherpa-onnx-offline-stt'; import type { STTResult, VADEvent, SpeakerEvent } from 'react-native-sherpa-onnx-offline-stt'; // Create manager instance const sttManager = new STTManager(); // Initialize with configuration await sttManager.initialize({ modelPath: '/path/to/stt-model', tokensPath: '/path/to/tokens.txt', modelType: 'offline', // or 'streaming' vadModelPath: '/path/to/vad-model', sampleRate: 16000, // Structured VAD configuration vad: { threshold: 0.5, minSpeechDurationMs: 300, minSilenceDurationMs: 500, maxSpeechDurationMs: 30000, // Force segment break after 30s speechPaddingMs: 100, mode: 'normal', // 'aggressive' | 'normal' | 'sensitive' }, // Optional features diarizationModelPath: '/path/to/speaker-model.onnx', diarizationThreshold: 0.55, denoiserModelPath: '/path/to/gtcrn_simple.onnx', }); // Subscribe to events using chainable API sttManager .on('transcript', (result: STTResult) => { console.log(`[Speaker ${result.speakerId}]: ${result.text}`); console.log(`RTFx: ${result.rtfx}, Processing: ${result.processingTime}s`); }) .on('streaming', (update) => { // Two-tier streaming state console.log('Confirmed:', update.confirmed); // Stable text console.log('Volatile:', update.volatile); // May change }) .on('vad', (event: VADEvent) => { console.log(`VAD: ${event.state}`); }) .on('speaker', (event: SpeakerEvent) => { console.log(`Speaker ${event.speakerId} (${event.status})`); }) .on('error', (error) => { console.error(`Error: ${error.code} - ${error.message}`); }); // Start recording await sttManager.startRecording(); // Stop recording const results = await sttManager.stopRecording(); // Clean up await sttManager.deinitialize(); ``` ## API Reference ### STTManager Class ```typescript const manager = new STTManager(); ``` #### Properties | Property | Type | Description | |----------|------|-------------| | `initialized` | `boolean` | Whether the engine is initialized | | `recording` | `boolean` | Whether currently recording | #### Methods | Method | Returns | Description | |--------|---------|-------------| | `initialize(config)` | `Promise<void>` | Initialize STT engine | | `startRecording()` | `Promise<void>` | Start microphone recording | | `stopRecording()` | `Promise<STTResult[]>` | Stop and get final results | | `recognizeFile(path)` | `Promise<STTResult[]>` | Transcribe audio file | | `isRecordingAsync()` | `Promise<boolean>` | Check recording status | | `getModelType()` | `Promise<ModelType>` | Get current mode | | `getSpeakerCount()` | `Promise<number>` | Get detected speakers | | `resetSpeakers()` | `Promise<void>` | Clear speaker profiles | | `setDenoiserEnabled(bool)` | `Promise<boolean>` | Toggle denoiser | | `isDenoiserEnabled()` | `Promise<boolean>` | Check denoiser status | | `startBackgroundService()` | `Promise<boolean>` | Enable background recording | | `stopBackgroundService()` | `Promise<boolean>` | Disable background recording | | `deinitialize()` | `Promise<void>` | Clean up resources | | `on(event, callback)` | `this` | Subscribe to events (chainable) | | `off(event)` | `this` | Unsubscribe from events (chainable) | #### Static Methods | Method | Returns | Description | |--------|---------|-------------| | `STTManager.getAvailableProviders()` | `Promise<DeviceProvidersInfo>` | Get available ONNX providers | | `STTManager.platform` | `string` | Current platform ('ios' or 'android') | ### Configuration ```typescript interface STTConfig { // Required modelPath: string; tokensPath: string; vadModelPath: string; // STT mode modelType?: 'streaming' | 'offline'; // Default: 'streaming' // VAD configuration vad: { threshold: number; // 0.5 - Speech detection sensitivity minSpeechDurationMs: number; // 300 - Min speech to trigger minSilenceDurationMs: number; // 500 - Silence to end segment maxSpeechDurationMs: number; // 30000 - Force break long speech speechPaddingMs: number; // 100 - Padding around segments mode: 'aggressive' | 'normal' | 'sensitive'; }; // Audio sampleRate?: number; // Default: 16000 // Speaker diarization (optional) diarizationModelPath?: string; diarizationThreshold?: number; // Default: 0.45 diarizationMinSpeechMs?: number; // Default: 800 // Denoiser (optional) denoiserModelPath?: string; // ONNX provider provider?: 'cpu' | 'nnapi' | 'gpu' | 'coreml'; // Default: 'cpu' } ``` ### Events #### transcript ```typescript interface STTResult { text: string; isFinal: boolean; startTime: number; endTime: number; // Performance metrics confidence: number; // 0-1 recognition confidence processingTime: number; // Seconds to process audioDuration: number; // Audio length in seconds rtfx: number; // Real-time factor (>1 = faster than real-time) // Speaker info speakerId?: number; speakerStatus?: 'pending' | 'confirmed'; } ``` #### streaming Two-tier transcript state for smoother UX: ```typescript interface StreamingTranscriptUpdate { volatile: string; // Current hypothesis (may change) confirmed: string; // Stable text (won't change) fullText: string; // confirmed + volatile isFinal: boolean; confidence: number; processingTime: number; rtfx: number; } ``` #### vad ```typescript interface VADEvent { state: 'silence' | 'speech_start' | 'speech' | 'speech_end'; speechProbability: number; speechDurationMs: number; silenceDurationMs: number; } ``` #### speaker ```typescript interface SpeakerEvent { speakerId: number; status: 'pending' | 'confirmed'; justConfirmed: boolean; totalSpeakers: number; } ``` #### error ```typescript interface STTError { code: string; message: string; } ``` ### VAD Modes | Mode | Description | Use Case | |------|-------------|----------| | `aggressive` | Less sensitive, fewer false positives | Noisy environments | | `normal` | Balanced sensitivity | General use | | `sensitive` | More sensitive, catches quieter speech | Quiet environments | ## Streaming vs Offline Mode | Feature | Streaming | Offline | |---------|-----------|---------| | Latency | Real-time partial results | Results after speech ends | | Accuracy | Good | Better | | Use case | Live captions | Meeting transcription | | Models | Zipformer | Parakeet, Whisper | ## Background Recording To record when the app is in background: ```typescript // Before starting recording await sttManager.startBackgroundService(); await sttManager.startRecording(); // When done await sttManager.stopRecording(); await sttManager.stopBackgroundService(); ``` This shows a notification while recording in background. ## Provider Detection Check available ONNX providers on the device: ```typescript const info = await STTManager.getAvailableProviders(); console.log(`Device: ${info.manufacturer} ${info.device}`); console.log(`Recommended: ${info.recommended}`); console.log('Available:', info.providers.filter(p => p.available).map(p => p.name)); ``` ## Platform Support | Platform | Status | |----------|--------| | Android | Full support | | iOS | Full support | ## License MIT ## Credits - [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) - Speech recognition engine - [TEN-VAD](https://github.com/ten-framework/TEN-VAD) - Voice activity detection - [GTCRN](https://github.com/Xiaobin-Rong/gtcrn) - Speech enhancement