echogarden

Version:

An easy-to-use speech toolset. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

github.com/echogarden-project/echogarden

echogarden-project/echogarden

412 lines (282 loc) • 10.1 kB

Markdown

# Node.js API reference **Note**: the API is not fully stable yet. It may change at every new version. There are many methods, types and internal data structures that are not yet exposed. ### Importing as a Node.js module To import the `echograden` package as a Node.js module: Install as a dependency in your project: ```bash npm install echogarden ``` Import with: ```ts import * as Echogarden from 'echogarden' ``` All methods, properties and arguments have TypeScript type information. You can use it to get more detailed and up-to-date type information that may not be covered here. ### Related pages * [Options reference](Options.md) * [List of all supported engines](Engines.md) * [Quick guide to the command line interface](CLI.md) * [WebSocket server reference](Server.md) ## Text-to-speech ### `synthesize(input, options, onSegment, onSentence)` Synthesizes the given input. * `input`: text to synthesize, can be a `string`, or a `string[]`. When given an array of strings, the elements of the array would be seen as predefined segments (this is useful if you would like to have more control over how segments are split, or your input has a special format requiring a custom splitting method). * `options`: synthesis options object * `onSegment`: a callback that is called whenever a segment has been synthesized (optional) * `onSentence`: a callback that is called whenever a sentence has been synthesized (optional) #### Returns (via promise): ```ts { audio: RawAudio | Buffer timeline: Timeline language: string } ``` `audio` may either be a * `RawAudio` object, which is a structure containing the sample rate and raw 32-bit float channels: ```ts { sampleRate: number channels: Float32Array[] } ``` * A `Uint8Array` containing the audio in encoded form, in the case a particular codec was specified in the `outputAudioFormat.codec` option. #### Segment and sentence event callbacks You can optionally pass two `async` callbacks to `synthesize`, `onSegment` and `onSentence`. For example: ```ts async function onSegment(data: SynthesisSegmentEventData) { console.log(data.transcript) } const { audio } = await Echogarden.synthesize("Hello World!", { engine: 'espeak' }, onSegment) ``` `SynthesisSegmentEventData` is an object with the structure: ```ts { index: number // Index of part total: number // Total number of parts audio: RawAudio | Buffer // Audio for part timeline: Timeline // Timeline for part transcript: string // Transcript for part language: string // Language for part peakDecibelsSoFar: number // Peak decibels measured for all synthesized audio, so far } ``` ### `requestVoiceList(options)` Requests a list of voices for a particular engine. * `options`: voice list request options object #### Returns (via promise): ```ts { voiceList: SynthesisVoice[] bestMatchingVoice: SynthesisVoice } ``` ## Speech-to-text ### `recognize(input, options)` Applies speech recognition to the input. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: recognition options object #### Returns (via promise): ```ts { transcript: string timeline: Timeline wordTimeline: Timeline language: string inputRawAudio: RawAudio isolatedRawAudio?: RawAudio backgroundRawAudio?: RawAudio } ``` ## Speech-to-transcript alignment ### `align(input, transcript, options)` Aligns input audio with the given transcript. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `transcript`: the transcript to align to * `options`: alignment options object #### Returns (via promise): ```ts { timeline: Timeline wordTimeline: Timeline transcript: string language: string inputRawAudio: RawAudio isolatedRawAudio?: RawAudio backgroundRawAudio?: RawAudio } ``` ## Speech-to-text translation ### `translateSpeech(input, options)` Translates speech audio directly to a transcript in a different language (only English is currently supported). * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: speech translation options object #### Returns (via promise): ```ts { transcript: string timeline: Timeline wordTimeline?: Timeline sourceLanguage: string targetLanguage: string inputRawAudio: RawAudio isolatedRawAudio?: RawAudio backgroundRawAudio?: RawAudio } ``` ## Text-to-text translation ### `translateText(input, options)` Translates text to text. * `input`: string * `options`: text translation options object #### Returns (via promise): ```ts { text: string translatedText: string translationPairs: TranslationPair[] sourceLanguage: string targetLanguage: string } ``` `translationPairs` is an array of objects corresponding to individual segments of the text and their translations. ## Speech-to-translated-transcript alignment ### `alignTranslation(input, translatedTranscript, options)` Aligns input audio with the given translated transcript. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `translatedTranscript`: the translated transcript to align to * `options`: translation alignment options object #### Returns (via promise): ```ts { timeline: Timeline wordTimeline: Timeline translatedTranscript: string sourceLanguage: string targetLanguage: string inputRawAudio: RawAudio isolatedRawAudio?: RawAudio backgroundRawAudio?: RawAudio } ``` ### `alignTranscriptAndTranslation(input, transcript, translatedTranscript, options)` Aligns input audio to both the native language transcript a translated one. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `transcript`: the transcript to align to, in the native speech language * `translatedTranscript`: the translated transcript to align to * `options`: transcript and translation alignment options object #### Returns (via promise): ```ts { timeline: Timeline wordTimeline: Timeline translatedTimeline: Timeline translatedWordTimeline: Timeline transcript: string translatedTranscript: string sourceLanguage: string targetLanguage: string inputRawAudio: RawAudio isolatedRawAudio?: RawAudio backgroundRawAudio?: RawAudio } ``` ### `alignTimelineTranslation(inputTimeline, translatedTranscript, options)` Aligns given timeline with its translated transcript. * `inputTimeline`: input timeline in the native language * `translatedTranscript`: the translated transcript to align to * `options`: timeline translation alignment options object #### Returns (via promise): ```ts { timeline: Timeline wordTimeline: Timeline sourceLanguage?: string targetLanguage: string rawAudio?: RawAudio } ``` ## Language detection ### `detectSpeechLanguage(input, options)` Detects language of spoken audio. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: speech language detection options object #### Returns (via promise): ```ts { detectedLanguage: string detectedLanguageName: string detectedLanguageProbabilities: LanguageDetectionResults } ``` ### `detectTextLanguage(input, options)` Detects language of text. * `input`: input text as `string` * `options`: text language detection options object #### Returns (via promise): ```ts { detectedLanguage: string detectedLanguageName: string detectedLanguageProbabilities: LanguageDetectionResults } ``` ## Voice activity detection ### `detectVoiceActivity(input, options)` Detects voice activity in audio (non-real-time). * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: voice activity detection options object #### Returns (via promise): ```ts { timeline: Timeline } ``` ## Speech denoising ### `denoise(input, options)` Tries to reduce background noise in spoken audio. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: denoising options object #### Returns (via promise): ```ts { denoisedAudio: RawAudio } ``` ## Source separation ### `isolate(input, options)` Attempts to isolate an individual [audio stem](https://en.wikipedia.org/wiki/Stem_(audio)), like human voice, or one or more musical instruments (depending on model training), from the given waveform. * `input`: can be an audio file path (`string`), encoded audio (`Buffer` or `Uint8array`) or a raw audio object (`RawAudio`) * `options`: source separation options object #### Returns (via promise): ```ts { inputRawAudio: RawAudio isolatedRawAudio: RawAudio backgroundRawAudio: RawAudio } ``` ## Subtitles ### `timelineToSubtitles(timeline, options)` Converts a timeline to subtitles. * `timeline`: timeline object * `options`: subtitles configuration object #### Returns: Subtitle file content, as a string. ### `subtitlesToTimeline(subtitles)` Converts subtitles to a timeline. * `subtitles`: timeline object **Note**: This function simply converts each individual cue to a segment entry in a timeline. Since subtitle cues may contain parts of sentences or phrases, this may not produce very useful results for your needs. However, you can use it as a means to parse a subtitle file (`srt` or `vtt`), and apply your own processing later. #### Returns: Timeline object. ## Global options ### `setGlobalOption(key, value)` Sets a global option. See the [options reference](Options.md) for more details about the available global options. ### `getGlobalOption(key)` Gets a global option. #### Returns: The value associated with the given key. ## TODO * Expose more methods that may be useful for developers, like phonemization, etc. * Expose audio playback used in CLI, possibly with timeline synchronization support.