Voice & TTS Configuration
Configure Text-to-Speech providers, voices, and audio settings for your EWT Voice Agent.
1. Overview
Text-to-Speech (TTS) is the final stage of the voice pipeline. Every time your agent responds, the flow is:
- STT — Caller audio is transcribed to text (Deepgram Nova-2).
- LLM — The transcript is sent to the language model, which generates a reply.
- TTS — The reply text is converted into audio and streamed back to the caller.
TTS quality directly affects how callers perceive your agent. A natural-sounding voice builds trust and keeps callers engaged; a robotic or glitchy voice drives hang-ups. Latency matters too — the TTS provider adds 200–500 ms to each response cycle, so choosing the right model is a balance between quality, speed, and cost.
EWT Voice Agent supports two TTS providers:
- ElevenLabs (default) — Premium neural voices with streaming WebSocket delivery.
- Deepgram Aura — Lower-cost voices with REST-based synthesis.
You set the provider per-agent using the tts_provider field. The default is elevenlabs.
2. ElevenLabs
ElevenLabs is the default TTS provider. It connects via a persistent WebSocket to stream audio chunks in real time as the LLM generates text, enabling the lowest possible time-to-first-byte.
Available Models
| Model ID | Description | Latency | Cost | Best For |
|---|---|---|---|---|
eleven_flash_v2_5 | Fastest model, optimized for low latency | ~200 ms | Lowest | Real-time phone calls (recommended default) |
eleven_turbo_v2_5 | Balanced speed and quality | ~350 ms | Medium | When you need slightly richer voice quality without sacrificing too much speed |
eleven_multilingual_v2 | Full multilingual support with highest quality | ~500 ms | Highest | Non-English calls or when voice quality is the top priority over latency |
eleven_flash_v2_5 is the best choice. It is the default when no model is specified. Only switch to eleven_multilingual_v2 if you need non-English voice support or premium quality for a demo.How Streaming Works
The ElevenLabs provider opens a WebSocket connection to:
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}&output_format=ulaw_8000
Audio is delivered as base64-encoded mu-law 8kHz chunks — the native format for telephony — so no transcoding is needed. The connection sends a BOS (beginning-of-stream) message with voice settings on open, streams text fragments via sendText(), and sends an empty text string to signal EOS (end-of-stream) via flush().
Pricing Reference
ElevenLabs charges per character. At roughly 300 characters per minute of speech, the platform estimates TTS cost at approximately $0.025/min for a typical call. This is reflected in the dashboard cost estimator.
3. ElevenLabs Voice Settings
voice_id
The voice_id determines which voice your agent uses. If not set, the default is pNInz6obpgDQGcFmaJgB (Adam). The dashboard groups voices into categories: Best for AI Callers, Professional, Cloned, Generated, and All Premade.
The following voices are curated as "Best for AI Callers" in the dashboard:
| Voice Name | Voice ID | Description |
|---|---|---|
| Adam | pNInz6obpgDQGcFmaJgB | Male, deep and warm (default) |
| Rachel | 21m00Tcm4TlvDq8ikWAM | Female, calm and professional |
| Antoni | ErXwobaYiN019PkySvjV | Male, friendly and conversational |
| Bella | EXAVITQu4vr4xnSDxMaL | Female, soft and warm |
| Elli | MF3mGyEYCl7XYWbV9V6O | Female, young and energetic |
| Josh | TxGEqnHWrfWFTfGW9XjX | Male, deep and authoritative |
| Sam | yoZ06aMxZJJ28mfd3POQ | Male, clear and neutral |
| James | ZQe5CZNOzWyzPSCn5a3c | Male, authoritative |
| Gigi | jBpfuIE2acCO8z3wKNLl | Female, animated and bright |
| Daniel | onwK4e9ZLuTAKqWW03F9 | Male, British accent |
You can also use any voice from your ElevenLabs account, including cloned voices and voices generated with their Voice Design tool.
stability (0–1)
Controls how consistent the voice sounds across generations. Set via the voice_stability field on the agent (mapped to stability in the ElevenLabs API).
| Value | Behavior |
|---|---|
| 0.0 – 0.3 | More expressive and variable. The voice may sound more emotional but less predictable. Good for storytelling or casual tones. |
| 0.3 – 0.6 | Balanced. Natural variation without being erratic. Recommended for most agents. |
| 0.6 – 1.0 | Very stable and monotone. Sounds more robotic but extremely consistent. Use for IVR menus or reading data. |
Default: 0.4
similarity_boost (0–1)
Controls how closely the output matches the original voice sample. Set via the voice_similarity field.
| Value | Behavior |
|---|---|
| 0.0 – 0.3 | Loose match. More generic but fewer artifacts. Good if the voice has quality issues. |
| 0.3 – 0.7 | Balanced. Sounds like the target voice without amplifying recording imperfections. |
| 0.7 – 1.0 | Very close match. Can amplify noise or artifacts from the source recording. Best for high-quality cloned voices. |
Default: 0.5
speed (0.5–2.0)
Controls playback speed of the generated audio. Set via the voice_speed field.
| Value | Use Case |
|---|---|
| 0.5 – 0.8 | Slower speech. Good for elderly callers or complex information. |
| 0.9 – 1.1 | Natural pace. Recommended for most agents. |
| 1.2 – 1.5 | Faster speech. Good for concise transactional calls. |
| 1.5 – 2.0 | Very fast. May introduce audio artifacts — test carefully before deploying. |
Default: 1.0
Recommended Settings by Use Case
| Use Case | Stability | Similarity | Speed |
|---|---|---|---|
| Professional customer service | 0.4 | 0.5 | 1.0 |
| Warm sales / follow-up | 0.3 | 0.5 | 1.05 |
| IVR / menu navigation | 0.7 | 0.5 | 0.95 |
| Casual / conversational | 0.25 | 0.4 | 1.0 |
| Reading data (numbers, addresses) | 0.6 | 0.5 | 0.9 |
4. Deepgram Aura
Deepgram Aura is an alternative TTS provider that uses a REST API instead of WebSockets. It accumulates text via sendText() and synthesizes the full buffer when flush() is called. Audio is returned as a streaming response in mu-law 8kHz format — no transcoding needed for telephony.
To use Deepgram, set tts_provider to "deepgram" on your agent. The voice is selected via voice_id, which maps to the Deepgram model name (e.g., aura-asteria-en).
Available Voices
| Voice ID | Name | Gender | Accent |
|---|---|---|---|
aura-asteria-en | Asteria | Female | US |
aura-luna-en | Luna | Female | US |
aura-stella-en | Stella | Female | US |
aura-athena-en | Athena | Female | UK |
aura-hera-en | Hera | Female | US |
aura-orion-en | Orion | Male | US |
aura-arcas-en | Arcas | Male | US |
aura-perseus-en | Perseus | Male | US |
aura-angus-en | Angus | Male | Ireland |
aura-orpheus-en | Orpheus | Male | US |
aura-helios-en | Helios | Male | UK |
aura-zeus-en | Zeus | Male | US |
The default voice when using Deepgram is aura-asteria-en (Asteria).
voice_stability, voice_similarity, or voice_speed settings. Those fields are ElevenLabs-specific and are ignored when tts_provider is "deepgram".When to Choose Deepgram
- Cost-sensitive deployments — Deepgram TTS is significantly cheaper than ElevenLabs, making it a good fit for high-volume use cases.
- Simpler architecture — REST-based synthesis means no WebSocket management or reconnection logic.
- English-only calls — All Aura voices are English. If you only serve English-speaking callers, Deepgram covers your needs.
5. Choosing a Provider
| Factor | ElevenLabs | Deepgram Aura |
|---|---|---|
| Latency | ~200–500 ms (WebSocket streaming, incremental) | ~300–600 ms (REST, waits for full text on flush) |
| Cost per minute | ~$0.025/min | Lower (pay-per-character, generally cheaper) |
| Voice quality | Premium neural voices, highly natural | Good quality, slightly less natural |
| Voice selection | Large library + custom cloned voices | 12 built-in voices |
| Language support | 29+ languages (multilingual_v2 model) | English only |
| Voice tuning | Stability, similarity, speed controls | No tuning parameters |
| Connection type | Persistent WebSocket | REST API per flush |
| Barge-in handling | Close and reconnect WebSocket to cancel | AbortController aborts the HTTP request |
| Voicemail audio | Supported (REST API pre-generation) | Not supported — voicemail always uses ElevenLabs |
tts_provider to "deepgram" for your agent's live calls while still using ElevenLabs for voicemail pre-generation. Voicemail audio is always generated via the ElevenLabs REST API regardless of the live TTS provider.6. Pronunciations
The pronunciations field is a JSONB array stored on each agent. It lets you define custom pronunciations for words that TTS engines commonly mispronounce — brand names, acronyms, technical terms, or proper nouns.
Format
Each entry is an object with word and pronunciation keys:
{
"pronunciations": [
{ "word": "EWT", "pronunciation": "E W T" },
{ "word": "HVAC", "pronunciation": "H-vack" },
{ "word": "Acme", "pronunciation": "Ak-mee" },
{ "word": "MySQL", "pronunciation": "My S Q L" },
{ "word": "GIF", "pronunciation": "Jif" }
]
}
Dashboard Entry
In the agent editor, the pronunciation dictionary is a text area where you enter one mapping per line using the format word=pronunciation:
EWT=E W T
HVAC=H-vack
Acme=Ak-mee
The dashboard converts this to the JSON array format when saving.
API=A P I ensures the engine says each letter instead of trying to pronounce it as a word.7. Voicemail Messages
When enable_voicemail_detection is true (the default), the platform can detect answering machines. The voicemail_message field contains the text that your agent will leave as a voicemail.
How It Works
- When you create or update an agent with both
voicemail_messageandvoice_idset, the platform fires a background job to pre-generate the voicemail audio. - The audio is generated via the ElevenLabs REST API (not WebSocket), using the same voice and model as the agent.
- Output format is MP3 at 44.1kHz / 128kbps, saved to
voicemail-audio/{agent_id}.mp3. - The voicemail audio is regenerated automatically if you change the
voicemail_message,voice_id, orelevenlabs_model. - If you clear the voicemail message, the MP3 file is deleted.
REST API Endpoint Used
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?output_format=mp3_44100_128
{
"text": "Hi, this is Sarah from Acme Corp. Sorry I missed your call...",
"model_id": "eleven_flash_v2_5",
"voice_settings": {
"stability": 0.4,
"similarity_boost": 0.5,
"speed": 1.0
}
}
8. Troubleshooting
Voice sounds robotic or flat
- Your
voice_stabilityis likely too high. Lower it from the default 0.4 toward 0.25–0.35 for a more expressive voice. - Check
voice_similarity— if it is very low (below 0.3), the voice may sound generic. Increase it to 0.5. - Try a different voice. Some ElevenLabs voices naturally sound more animated than others. Rachel and Antoni tend to sound conversational out of the box.
Audio glitches or clipping at high speed
- Speeds above 1.5x frequently produce artifacts, especially on longer sentences. Reduce
voice_speedto 1.2 or lower. - If you need fast speech, consider keeping speed at 1.0 and instead instructing the LLM (via system prompt) to generate shorter, punchier responses.
WebSocket drops after ~15 seconds of silence
- The ElevenLabs WebSocket will close if no text is sent for roughly 15 seconds. This can happen during long tool calls or when the caller is providing extended input.
- The platform handles this by reconnecting on the next
sendText()call, but it adds latency to that first response after the gap. - Mitigation: keep your
idle_timeout_seconds(default 7.5s) below 15 seconds so the agent sends an idle prompt before the WebSocket times out.
ElevenLabs WebSocket connection timeout
- The provider has a 5-second connection timeout. If the connection fails, check that your ElevenLabs API key is valid and that the
voice_idexists in your account. - Verify your tenant settings have
elevenlabs_api_keyconfigured and the key has available character quota.
Deepgram TTS returns errors
- Verify the
voice_idmatches one of the supported Aura model names exactly (e.g.,aura-asteria-en). - Check that your Deepgram API key is set in tenant settings and has TTS permissions enabled.
Voicemail audio not generating
- Both
voicemail_messageandvoice_idmust be set on the agent. - The tenant must have a valid
elevenlabs_api_keyin settings (voicemail always uses ElevenLabs). - Check server logs for errors from the
agentslogger — generation failures are logged but do not block the agent save.
Full Example: Professional Customer Service Agent
The following JSON creates an agent optimized for professional customer service calls with tuned voice settings:
{
"name": "Support Agent - Sarah",
"system_prompt": "You are Sarah, a professional and empathetic customer support agent for Acme Corp. Keep responses to 1-2 sentences. Listen carefully, ask clarifying questions, and provide clear solutions. If you cannot resolve the issue, offer to transfer to a specialist.",
"first_message": "Hi, thanks for calling Acme support! How can I help you today?",
"tone": "professional",
"model": "openai/gpt-4o",
"language": "en",
"tts_provider": "elevenlabs",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"elevenlabs_model": "eleven_flash_v2_5",
"voice_stability": 0.4,
"voice_similarity": 0.5,
"voice_speed": 1.0,
"max_call_duration": 600,
"endpointing": 250,
"max_tokens": 200,
"temperature": 0.7,
"enable_voicemail_detection": true,
"voicemail_message": "Hi, this is Sarah from Acme support. Sorry I missed your call. Please leave a message and I will have someone get back to you within the hour. Thanks!",
"pronunciations": [
{ "word": "Acme", "pronunciation": "Ak-mee" },
{ "word": "SLA", "pronunciation": "S L A" }
],
"end_call_phrases": ["goodbye", "bye", "that's all", "thanks bye"],
"enable_call_analysis": true,
"success_eval_prompt": "Was the customer's issue resolved? Did they seem satisfied?",
"idle_timeout_seconds": 7.5,
"idle_max_triggers": 3,
"first_message_mode": "assistant-speaks-first"
}
Send this as a POST request to /api/agents with your authentication token to create the agent.