UNPKG

speech-to-element

Version:

Add real-time speech to text functionality into your website with no effort

343 lines (250 loc) 29.4 kB
<img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/banner-white.png" alt="Logo"> <b>Speech To Element</b> is an all purpose [npm](https://www.npmjs.com/package/speech-to-element) library that can transcribe speech into text right out of the box! Try it out in the [official website](https://speechtoelement.com). ### :zap: Services - [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API) - [Azure Cognitive Speech Services API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text) https://github.com/OvidijusParsiunas/speech-to-element/assets/18709577/e2e618f8-b61c-4877-804b-26eeefbb0afa ### :computer: How to use [NPM](https://www.npmjs.com/package/speech-to-element): ``` npm install speech-to-element ``` ``` import SpeechToElement from 'speech-to-element'; const targetElement = document.getElementById('target-element'); SpeechToElement.toggle('webspeech', {element: targetElement}); ``` [CDN](https://cdn.jsdelivr.net/gh/ovidijusparsiunas/speech-to-element@master/component/bundle/index.min.js): ``` <script type="module" src="https://cdn.jsdelivr.net/gh/ovidijusparsiunas/speech-to-element@master/component/bundle/index.min.js"></script> ``` ``` const targetElement = document.getElementById('target-element'); window.SpeechToElement.toggle('webspeech', {element: targetElement}); ``` When using Azure, you will also need to install its speech [SDK](https://www.npmjs.com/package/microsoft-cognitiveservices-speech-sdk). Read more in the [Azure SDK](#floppy_disk-azure-sdk) section. <br /> Make sure to checkout the [examples](https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples) directory to browse templates for [React](https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/ui), [Next.js](https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/nextjs) and more. ## :construction_worker: Local setup ``` # Install node dependencies: $ npm install # Serve the component locally (from index.html): $ npm run start # Build the component into a module (dist/index.js): $ npm run build:module ``` ### :beginner: API #### Methods Used to control Speech To Element transcription: | Name | Description | | :------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------- | | startWebSpeech({[`Options`](#options) & [`WebSpeechOptions`](#webspeechoptions)}) | Start [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API) | | startAzure({[`Options`](#options) & [`AzureOptions`](#azureoptions)}) | Start [Azure API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text) | | toggle("webspeech", {[`Options`](#options) & [`WebSpeechOptions`](#webspeechoptions)}) | Start/Stop [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API) | | toggle("azure", {[`Options`](#options) & [`AzureOptions`](#azureoptions)}) | Start/Stop [Azure API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text) | | stop() | Stops all speech services | | endCommandMode() | Ends the [`command`](#commands) mode | Examples: ``` SpeechToElement.startWebSpeech({element: targetElement, displayInterimResults: false}); SpeechToElement.startAzure({element: targetElement, region: 'westus', token: 'token'}); SpeechToElement.toggle('webspeech', {element: targetElement, language: 'en-US'}); SpeechToElement.toggle('azure', {element: targetElement, region: 'eastus', subscriptionKey: 'key'}); SpeechToElement.stop(); SpeechToElement.endCommandMode(); ``` #### Object Types ##### Options: Generic options for the speech to element functionality: | Name | Type | Description | | :------------------------- | :------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------- | | element | `Element \| Element[]` | Transcription target element. By defining multiple inside an array the user can switch between them in the same session by clicking on them. | | autoScroll | `boolean` | Controls if element will automatically scroll to the new text. | | displayInterimResults | `boolean` | Controls if interim result are displayed. | | textColor | [`TextColor`](#textcolor) | Object defining the result text colors. | | translations | `{[key: string]: string}` | Case-sensitive one-to-one map of words that will automatically be translated to others. | | commands | [`Commands`](#commands) | Set the phrases that will trigger various chat functionality. | | onStart | `() => void` | Triggered when speech recording has started. | | onStop | `() => void` | Triggered when speech recording has stopped. | | onResult | `( text: string, isFinal: boolean ) => void` | Triggered when a new result is transcribed and inserted into element. | | onPreResult | `( text: string, isFinal: boolean )` => [PreResult](#preresult) \| `void` | Triggered before result text insertion. This function can be used to control the speech service based on what was spoken via the [PreResult](#preresult) object. | | onCommandMode<br />Trigger | `(isStart: boolean) => void` | Triggered when command mode is initiated and stopped. | | onPauseTrigger | `(isStart: boolean) => void` | Triggered when the pause command is initiated and stopped via resume command. | | onError | `(message: string) => void` | Triggered when an error has occurred. | Examples: ``` SpeechToElement.toggle('webspeech', {element: targetElement, translations: {hi: 'bye', Hi: 'Bye'}}); SpeechToElement.toggle('webspeech', {onResult: (text) => console.log(text)}); ``` ##### TextColor: Object used to set the color for transcription result text (does not work for `input` and `textarea` elements): | Name | Type | Description | | :------ | :------- | :------------------- | | interim | `string` | Temporary text color | | final | `string` | Final text color | Example: ``` SpeechToElement.toggle('webspeech', { element: targetElement, textColor: {interim: 'grey', final: 'black'} }); ``` ##### Commands: https://github.com/OvidijusParsiunas/speech-to-element/assets/18709577/cca6bc40-ceb7-4d48-92e4-31c5f66366eb Object used to set the phrases of commands that will control transcription and input functionality: | Name | Type | Description | | :------------ | :------------------------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------- | | stop | `string` | Stop the speech service | | pause | `string` | Temporarily stops the transcription and re-enables it after the phrase for `resume` is spoken. | | resume | `string` | Re-enables transcription after it has been stopped by the `pause` or `commandMode` commands. | | reset | `string` | Remove the transcribed text (since the last element cursor move) | | removeAllText | `string` | Remove all element text | | commandMode | `string` | Activate the command mode which will stop the transcription and wait for a command to be executed. Use the phrase for `resume` to leave the command mode. | | settings | [`CommandSettings`](#commandsettings) | Controls how command mode is used. | Example: ``` SpeechToElement.toggle('webspeech', { element: targetElement, commands: { pause: 'pause', resume: 'resume', removeAllText: 'remove text', commandMode: 'command' } }); ``` ##### CommandSettings: Object used to configure how the command phrases are interpreted: | Name | Type | Description | | :------------ | :-------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | substrings | `boolean` | Toggles whether command phrases can be part of spoken words or if they are whole words. E.g. when this is set to _true_ and your command phrase is _"stop"_ - when you say "stopping" the command will be executed. However if it is set to _false_ - the command will only be executed if you say "stop". | | caseSensitive | `boolean` | Toggles if command phrases are case sensitive. E.g. if this is set to _true_ and your command phrase is _"stop"_ - when the service recognizes your speech as "Stop" it will not execute your command. On the other hand if it is set to _false_ it will execute. | Example: ``` SpeechToElement.toggle('webspeech', { element: targetElement, commands: { removeAllText: 'remove text', settings: { substrings: true, caseSensitive: false }} }); ``` ##### PreResult: Result object for the `onPreResult` function. This can be used to control the speech service and facilitate custom commands for your application: | Name | Type | Description | | :------------ | :-------- | :---------------------------------------------------------------------------------------------------------------- | | stop | `boolean` | Stops the speech service | | restart | `boolean` | Restarts the speech service | | removeNewText | `boolean` | Toggles whether the newly spoken (interim) text is removed when either of the above properties are set to `true`. | Example for a creating a custom command: ``` SpeechToElement.toggle('webspeech', { element: targetElement, onPreResult: (text) => { if (text.toLowerCase().includes('custom command')) { SpeechToElement.endCommandMode(); your custom code here return {restart: true, removeNewText: true}; }} }); ``` ##### WebSpeechOptions: Custom options for the [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API): | Name | Type | Description | | :------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | language | `string` | This is the recognition language. See the following [`QA`](https://stackoverflow.com/questions/23733537/what-are-the-supported-languages-for-web-speech-api-in-html5) for the full list. | Example: ``` SpeechToElement.toggle('webspeech', {element: targetElement, language: 'en-GB'}); ``` ##### AzureOptions: Options for the [Azure Cognitive Speech Services API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text). This object REQUIRES `region` and either `retrieveToken` or `subscriptionKey` or `token` properties to be defined with it: | Name | Type | Description | | :----------------- | :------------------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | region | `string` | Location/region of your Azure speech resource. | | retrieveToken | `() => Promise<string>` | Function used to retrieve a new token for your Azure speech resource. It is the recommended property to use as it can retrieve the token from a secure server that will hide your credentials. Check out the [starter server templates](https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples) to start a local server in seconds. | | subscriptionKey | `string` | Subscription key for your Azure speech resource. | | token | `string` | Temporary token for the Azure speech resource. | | language | `string` | BCP-47 string value to denote the recognition language. You can find the full list [here](https://docs.microsoft.com/azure/cognitive-services/speech-service/supported-languages). | | autoLanguage | [`AutoLanguage`](#AutoLanguage) | Automatically identify the spoken language based on a provided list. | | endpointId | `endpointId` | Endpoint ID of a customized speech model. | | deviceId | `deviceId` | ID of specific media device. More info [here](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-select-audio-input-devices#audio-device-ids-in-javascript). | | stopAfterSilenceMs | `number` | Milliseconds of silence required for the speech service to automatically stop. Default is 25000ms (25 seconds). | Examples: ``` SpeechToElement.toggle('azure', { element: targetElement, region: 'eastus', token: 'token', language: 'ja-JP' }); SpeechToElement.toggle('azure', { element: targetElement, region: 'southeastasia', retrieveToken: async () => { return fetch('http://localhost:8080/token') .then((res) => res.text()) .then((token) => token) .catch((error) => console.error('error')); } }); ``` <br /> ##### AutoLanguage: Object used to configure automatic [language identification](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-identification) based on a list of candidate `languages`: | Name | Type | Description | | :-------- | :-------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | languages | `string[]` | An array of candidate languages that that will be present in the audio. See available languages [here](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=language-identification). Need at least 1 language. When using `AtStart`, the maximum number of languages is 4, when using `Continuous` the maximum is 10. | | type | `'AtStart' \| 'Continuous'` | Optional property that defines if the language can be identified in the first 5 seconds and does not change via `AtStart`, or if there can be multiple languages throughout the speech via `Continuous`. `AtStart` set by default. | <br /> Example server templates for the `retrieveToken` property: | Express | Nest | Flask | Spring | Go | Next | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/node/express" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/expressLogo.png" width="60"/></a> | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/node/nestjs" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/nestLogo.png" width="60"/></a> | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/python/flask" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/flaskLogo.png" width="60"/></a> | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/java/springboot" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/springBootLogo.png" width="50"/></a> | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/go" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/goLogo.png" width="40"/></a> | <a href="https://github.com/OvidijusParsiunas/speech-to-element/tree/main/examples/nextjs" target="_blank"><img src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/nextLogo.png" width="55"/></a> | <br /> Location of `subscriptionKey` and `region` details in Azure Portal: <img width="987" src="https://raw.githubusercontent.com/OvidijusParsiunas/speech-to-element/HEAD/assets/azure-credentials.png" alt="Credentials location in Azure Portal"> <br /> ### :floppy_disk: Azure SDK To use the [Azure Cognitive Speech Services API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text), you will need to add the official [Azure Speech SDK](https://www.npmjs.com/package/microsoft-cognitiveservices-speech-sdk) into your project and assign it to the `window.SpeechSDK` variable. Here are some simple ways you can achieve this: - <b>Import from a dependancy:</b> If you are using a dependancy manager, import and assign it to window.SpeechSDK: ``` import * as sdk from 'microsoft-cognitiveservices-speech-sdk'; window.SpeechSDK = sdk; ``` - <b>Dynamic import from a dependancy</b> If you are using a dependancy manager, dynamically import and assign it to window.SpeechSDK: ``` import('microsoft-cognitiveservices-speech-sdk').then((module) => { window.SpeechSDK = module; }); ``` - <b>Script from a CDN</b> You can add a script tag to your markup or create one via javascript. The window.SpeechSDK property will be populated automatically: ``` <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.8.0/highlight.min.js"></script> const script = document.createElement("script"); script.src = "https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.8.0/highlight.min.js"; document.body.appendChild(script); ``` If your project is using `TypeScript`, add this to the file where the module is used: ``` import * as sdk from 'microsoft-cognitiveservices-speech-sdk'; declare global { interface Window { SpeechSDK: typeof sdk; } } ``` Examples: Example React project that uses a package bundler. It should work similarly for other UI frameworks: [Click for Live Example](https://stackblitz.com/edit/stackblitz-starters-ujkq7j?file=src%2FApp.tsx) VanillaJS approach with no bundler (this can also be used as fallback if above doesn't work): [Click for Live Example](https://codesandbox.io/s/speech-to-element-azure-vanillajs-gvj9v4?file=/index.html) ## :star: Example Product [Deep Chat](https://deepchat.dev/) - an AI oriented chat component that is using Speech To Element to power its Speech To Text capabilities. ## :heart: Contributions Open source is built by the community for the community. All contributions to this project are welcome!<br> Additionally, if you have any suggestions for enhancements, ideas on how to take the project further or have discovered a bug, do not hesitate to create a new issue ticket and we will look into it as soon as possible!