@mastra/voice-cloudflare
Version:
Mastra Cloudflare AI voice integration
349 lines (268 loc) • 12.7 kB
Markdown
# Voice
Mastra agents can be enhanced with voice capabilities, allowing them to speak responses and listen to user input. You can configure an agent to use either a single voice provider or combine multiple providers for different operations.
## Basic usage
The simplest way to add voice to an agent is to use a single provider for both speaking and listening:
```typescript
import { createReadStream } from 'fs'
import path from 'path'
import { Agent } from '@mastra/core/agent'
import { OpenAIVoice } from '@mastra/voice-openai'
// Initialize the voice provider with default settings
const voice = new OpenAIVoice()
// Create an agent with voice capabilities
export const agent = new Agent({
id: 'voice-agent',
name: 'Voice Agent',
instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
model: 'openai/gpt-5.4',
voice,
})
// The agent can now use voice for interaction
const audioStream = await agent.voice.speak("Hello, I'm your AI assistant!", {
filetype: 'm4a',
})
playAudio(audioStream!)
try {
const transcription = await agent.voice.listen(audioStream)
console.log(transcription)
} catch (error) {
console.error('Error transcribing audio:', error)
}
```
## Working with audio streams
The `speak()` and `listen()` methods work with Node.js streams. Here's how to save and load audio files:
### Saving Speech Output
The `speak` method returns a stream that you can pipe to a file or speaker.
```typescript
import { createWriteStream } from 'fs'
import path from 'path'
// Generate speech and save to file
const audio = await agent.voice.speak('Hello, World!')
const filePath = path.join(process.cwd(), 'agent.mp3')
const writer = createWriteStream(filePath)
audio.pipe(writer)
await new Promise<void>((resolve, reject) => {
writer.on('finish', () => resolve())
writer.on('error', reject)
})
```
### Transcribing Audio Input
The `listen` method expects a stream of audio data from a microphone or file.
```typescript
import { createReadStream } from 'fs'
import path from 'path'
// Read audio file and transcribe
const audioFilePath = path.join(process.cwd(), '/agent.m4a')
const audioStream = createReadStream(audioFilePath)
try {
console.log('Transcribing audio file...')
const transcription = await agent.voice.listen(audioStream, {
filetype: 'm4a',
})
console.log('Transcription:', transcription)
} catch (error) {
console.error('Error transcribing audio:', error)
}
```
## Speech-to-speech voice interactions
For more dynamic and interactive voice experiences, you can use real-time voice providers that support speech-to-speech capabilities:
```typescript
import { Agent } from '@mastra/core/agent'
import { getMicrophoneStream } from '@mastra/node-audio'
import { OpenAIRealtimeVoice } from '@mastra/voice-openai-realtime'
import { search, calculate } from '../tools'
// Initialize the realtime voice provider
const voice = new OpenAIRealtimeVoice({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-5.1-realtime',
speaker: 'alloy',
})
// Create an agent with speech-to-speech voice capabilities
export const agent = new Agent({
id: 'speech-to-speech-agent',
name: 'Speech-to-Speech Agent',
instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
model: 'openai/gpt-5.4',
tools: {
// Tools configured on Agent are passed to voice provider
search,
calculate,
},
voice,
})
// Establish a WebSocket connection
await agent.voice.connect()
// Start a conversation
agent.voice.speak("Hello, I'm your AI assistant!")
// Stream audio from a microphone
const microphoneStream = getMicrophoneStream()
agent.voice.send(microphoneStream)
// When done with the conversation
agent.voice.close()
```
### Event System
The realtime voice provider emits several events you can listen for:
```typescript
// Listen for speech audio data sent from voice provider
agent.voice.on('speaking', ({ audio }) => {
// audio contains ReadableStream or Int16Array audio data
})
// Listen for transcribed text sent from both voice provider and user
agent.voice.on('writing', ({ text, role }) => {
console.log(`${role} said: ${text}`)
})
// Listen for errors
agent.voice.on('error', error => {
console.error('Voice error:', error)
})
```
## Examples
### End-to-end voice interaction
This example demonstrates a voice interaction between two agents. The hybrid voice agent, which uses multiple providers, speaks a question, which is saved as an audio file. The unified voice agent listens to that file, processes the question, generates a response, and speaks it back. Both audio outputs are saved to the `audio` directory.
The following files are created:
- **hybrid-question.mp3** – Hybrid agent's spoken question.
- **unified-response.mp3** – Unified agent's spoken response.
```typescript
import 'dotenv/config'
import path from 'path'
import { createReadStream } from 'fs'
import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { OpenAIVoice } from '@mastra/voice-openai'
import { Mastra } from '@mastra/core'
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const saveAudioToFile = async (
audio: NodeJS.ReadableStream,
filename: string,
): Promise<void> => {
const audioDir = path.join(process.cwd(), 'audio')
const filePath = path.join(audioDir, filename)
await fs.promises.mkdir(audioDir, { recursive: true })
const writer = createWriteStream(filePath)
audio.pipe(writer)
return new Promise((resolve, reject) => {
writer.on('finish', resolve)
writer.on('error', reject)
})
}
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const convertToText = async (input: string | NodeJS.ReadableStream): Promise<string> => {
if (typeof input === 'string') {
return input
}
const chunks: Buffer[] = []
return new Promise((resolve, reject) => {
inputData.on('data', chunk => chunks.push(Buffer.from(chunk)))
inputData.on('error', reject)
inputData.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8')))
})
}
export const hybridVoiceAgent = new Agent({
id: 'hybrid-voice-agent',
name: 'Hybrid Voice Agent',
model: 'openai/gpt-5.4',
instructions: 'You can speak and listen using different providers.',
voice: new CompositeVoice({
input: new OpenAIVoice(),
output: new OpenAIVoice(),
}),
})
export const unifiedVoiceAgent = new Agent({
id: 'unified-voice-agent',
name: 'Unified Voice Agent',
instructions: 'You are an agent with both STT and TTS capabilities.',
model: 'openai/gpt-5.4',
voice: new OpenAIVoice(),
})
export const mastra = new Mastra({
agents: { hybridVoiceAgent, unifiedVoiceAgent },
})
const hybridVoiceAgent = mastra.getAgent('hybridVoiceAgent')
const unifiedVoiceAgent = mastra.getAgent('unifiedVoiceAgent')
const question = 'What is the meaning of life in one sentence?'
const hybridSpoken = await hybridVoiceAgent.voice.speak(question)
await saveAudioToFile(hybridSpoken!, 'hybrid-question.mp3')
const audioStream = createReadStream(path.join(process.cwd(), 'audio', 'hybrid-question.mp3'))
const unifiedHeard = await unifiedVoiceAgent.voice.listen(audioStream)
const inputText = await convertToText(unifiedHeard!)
const unifiedResponse = await unifiedVoiceAgent.generate(inputText)
const unifiedSpoken = await unifiedVoiceAgent.voice.speak(unifiedResponse.text)
await saveAudioToFile(unifiedSpoken!, 'unified-response.mp3')
```
### Using Multiple Providers
For more flexibility, you can use different providers for speaking and listening using the CompositeVoice class:
```typescript
import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { OpenAIVoice } from '@mastra/voice-openai'
import { PlayAIVoice } from '@mastra/voice-playai'
export const agent = new Agent({
id: 'voice-agent',
name: 'Voice Agent',
instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
model: 'openai/gpt-5.4',
// Create a composite voice using OpenAI for listening and PlayAI for speaking
voice: new CompositeVoice({
input: new OpenAIVoice(),
output: new PlayAIVoice(),
}),
})
```
### Using AI SDK
Mastra supports using AI SDK's transcription and speech models directly in `CompositeVoice`, giving you access to a wide range of providers through the AI SDK ecosystem:
```typescript
import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { openai } from '@ai-sdk/openai'
import { elevenlabs } from '@ai-sdk/elevenlabs'
import { groq } from '@ai-sdk/groq'
export const agent = new Agent({
id: 'aisdk-voice-agent',
name: 'AI SDK Voice Agent',
instructions: `You are a helpful assistant with voice capabilities.`,
model: 'openai/gpt-5.4',
// Pass AI SDK models directly to CompositeVoice
voice: new CompositeVoice({
input: openai.transcription('whisper-1'), // AI SDK transcription model
output: elevenlabs.speech('eleven_turbo_v2'), // AI SDK speech model
}),
})
// Use voice capabilities as usual
const audioStream = await agent.voice.speak('Hello!')
const transcribedText = await agent.voice.listen(audioStream)
```
#### Mix and Match Providers
You can mix AI SDK models with Mastra voice providers:
```typescript
import { CompositeVoice } from '@mastra/core/voice'
import { PlayAIVoice } from '@mastra/voice-playai'
import { openai } from '@ai-sdk/openai'
// Use AI SDK for transcription and Mastra provider for speech
const voice = new CompositeVoice({
input: openai.transcription('whisper-1'), // AI SDK
output: new PlayAIVoice(), // Mastra provider
})
```
For the complete list of supported AI SDK providers and their capabilities:
- [Transcription](https://ai-sdk.dev/docs/providers/openai/transcription)
- [Speech](https://ai-sdk.dev/docs/providers/elevenlabs/speech)
## Supported voice providers
Mastra supports multiple voice providers for text-to-speech (TTS) and speech-to-text (STT) capabilities:
| Provider | Package | Features | Reference |
| --------------- | ------------------------------- | ------------------------- | ------------------------------------------------------------------ |
| OpenAI | `/voice-openai` | TTS, STT | [Documentation](https://mastra.ai/reference/voice/openai) |
| OpenAI Realtime | `/voice-openai-realtime` | Realtime speech-to-speech | [Documentation](https://mastra.ai/reference/voice/openai-realtime) |
| ElevenLabs | `/voice-elevenlabs` | High-quality TTS | [Documentation](https://mastra.ai/reference/voice/elevenlabs) |
| PlayAI | `/voice-playai` | TTS | [Documentation](https://mastra.ai/reference/voice/playai) |
| Google | `/voice-google` | TTS, STT | [Documentation](https://mastra.ai/reference/voice/google) |
| Deepgram | `/voice-deepgram` | STT | [Documentation](https://mastra.ai/reference/voice/deepgram) |
| Murf | `/voice-murf` | TTS | [Documentation](https://mastra.ai/reference/voice/murf) |
| Speechify | `/voice-speechify` | TTS | [Documentation](https://mastra.ai/reference/voice/speechify) |
| Sarvam | `/voice-sarvam` | TTS, STT | [Documentation](https://mastra.ai/reference/voice/sarvam) |
| Azure | `/voice-azure` | TTS, STT | [Documentation](https://mastra.ai/reference/voice/mastra-voice) |
| Cloudflare | `/voice-cloudflare` | TTS | [Documentation](https://mastra.ai/reference/voice/mastra-voice) |
## Next steps
- [Voice API Reference](https://mastra.ai/reference/voice/mastra-voice) - Detailed API documentation for voice capabilities
- [Text to Speech Examples](https://github.com/mastra-ai/voice-examples/tree/main/text-to-speech) - Interactive story generator and other TTS implementations
- [Speech to Text Examples](https://github.com/mastra-ai/voice-examples/tree/main/speech-to-text) - Voice memo app and other STT implementations
- [Speech to Speech Examples](https://github.com/mastra-ai/voice-examples/tree/main/speech-to-speech) - Real-time voice conversation with call analysis