Voice & TTS Configuration

Configure Text-to-Speech providers, voices, and audio settings for your EWT Voice Agent.

1. Overview

Text-to-Speech (TTS) is the final stage of the voice pipeline. Every time your agent responds, the flow is:

STT — Caller audio is transcribed to text (Deepgram Nova-2).
LLM — The transcript is sent to the language model, which generates a reply.
TTS — The reply text is converted into audio and streamed back to the caller.

TTS quality directly affects how callers perceive your agent. A natural-sounding voice builds trust and keeps callers engaged; a robotic or glitchy voice drives hang-ups. Latency matters too — the TTS provider adds 200–500 ms to each response cycle, so choosing the right model is a balance between quality, speed, and cost.

EWT Voice Agent supports two TTS providers:

ElevenLabs (default) — Premium neural voices with streaming WebSocket delivery.
Deepgram Aura — Lower-cost voices with REST-based synthesis.

You set the provider per-agent using the tts_provider field. The default is elevenlabs.

2. ElevenLabs

ElevenLabs is the default TTS provider. It connects via a persistent WebSocket to stream audio chunks in real time as the LLM generates text, enabling the lowest possible time-to-first-byte.

Available Models

Model ID	Description	Latency	Cost	Best For
`eleven_flash_v2_5`	Fastest model, optimized for low latency	~200 ms	Lowest	Real-time phone calls (recommended default)
`eleven_turbo_v2_5`	Balanced speed and quality	~350 ms	Medium	When you need slightly richer voice quality without sacrificing too much speed
`eleven_multilingual_v2`	Full multilingual support with highest quality	~500 ms	Highest	Non-English calls or when voice quality is the top priority over latency

For most phone-based agents, eleven_flash_v2_5 is the best choice. It is the default when no model is specified. Only switch to eleven_multilingual_v2 if you need non-English voice support or premium quality for a demo.

How Streaming Works

The ElevenLabs provider opens a WebSocket connection to:

wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}&output_format=ulaw_8000

Audio is delivered as base64-encoded mu-law 8kHz chunks — the native format for telephony — so no transcoding is needed. The connection sends a BOS (beginning-of-stream) message with voice settings on open, streams text fragments via sendText(), and sends an empty text string to signal EOS (end-of-stream) via flush().

Pricing Reference

ElevenLabs charges per character. At roughly 300 characters per minute of speech, the platform estimates TTS cost at approximately $0.025/min for a typical call. This is reflected in the dashboard cost estimator.

3. ElevenLabs Voice Settings

voice_id

The voice_id determines which voice your agent uses. If not set, the default is pNInz6obpgDQGcFmaJgB (Adam). The dashboard groups voices into categories: Best for AI Callers, Professional, Cloned, Generated, and All Premade.

The following voices are curated as "Best for AI Callers" in the dashboard:

Voice Name	Voice ID	Description
Adam	`pNInz6obpgDQGcFmaJgB`	Male, deep and warm (default)
Rachel	`21m00Tcm4TlvDq8ikWAM`	Female, calm and professional
Antoni	`ErXwobaYiN019PkySvjV`	Male, friendly and conversational
Bella	`EXAVITQu4vr4xnSDxMaL`	Female, soft and warm
Elli	`MF3mGyEYCl7XYWbV9V6O`	Female, young and energetic
Josh	`TxGEqnHWrfWFTfGW9XjX`	Male, deep and authoritative
Sam	`yoZ06aMxZJJ28mfd3POQ`	Male, clear and neutral
James	`ZQe5CZNOzWyzPSCn5a3c`	Male, authoritative
Gigi	`jBpfuIE2acCO8z3wKNLl`	Female, animated and bright
Daniel	`onwK4e9ZLuTAKqWW03F9`	Male, British accent

You can also use any voice from your ElevenLabs account, including cloned voices and voices generated with their Voice Design tool.

stability (0–1)

Controls how consistent the voice sounds across generations. Set via the voice_stability field on the agent (mapped to stability in the ElevenLabs API).

Value	Behavior
0.0 – 0.3	More expressive and variable. The voice may sound more emotional but less predictable. Good for storytelling or casual tones.
0.3 – 0.6	Balanced. Natural variation without being erratic. Recommended for most agents.
0.6 – 1.0	Very stable and monotone. Sounds more robotic but extremely consistent. Use for IVR menus or reading data.

Default: 0.4

similarity_boost (0–1)

Controls how closely the output matches the original voice sample. Set via the voice_similarity field.

Value	Behavior
0.0 – 0.3	Loose match. More generic but fewer artifacts. Good if the voice has quality issues.
0.3 – 0.7	Balanced. Sounds like the target voice without amplifying recording imperfections.
0.7 – 1.0	Very close match. Can amplify noise or artifacts from the source recording. Best for high-quality cloned voices.

Default: 0.5

speed (0.5–2.0)

Controls playback speed of the generated audio. Set via the voice_speed field.

Value	Use Case
0.5 – 0.8	Slower speech. Good for elderly callers or complex information.
0.9 – 1.1	Natural pace. Recommended for most agents.
1.2 – 1.5	Faster speech. Good for concise transactional calls.
1.5 – 2.0	Very fast. May introduce audio artifacts — test carefully before deploying.

Default: 1.0

Speeds above 1.5x can cause audible glitching and clipping artifacts, especially with longer sentences. Always test at your target speed with real call recordings before going live.

Recommended Settings by Use Case

Use Case	Stability	Similarity	Speed
Professional customer service	0.4	0.5	1.0
Warm sales / follow-up	0.3	0.5	1.05
IVR / menu navigation	0.7	0.5	0.95
Casual / conversational	0.25	0.4	1.0
Reading data (numbers, addresses)	0.6	0.5	0.9

4. Deepgram Aura

Deepgram Aura is an alternative TTS provider that uses a REST API instead of WebSockets. It accumulates text via sendText() and synthesizes the full buffer when flush() is called. Audio is returned as a streaming response in mu-law 8kHz format — no transcoding needed for telephony.

To use Deepgram, set tts_provider to "deepgram" on your agent. The voice is selected via voice_id, which maps to the Deepgram model name (e.g., aura-asteria-en).

Available Voices

Voice ID	Name	Gender	Accent
`aura-asteria-en`	Asteria	Female	US
`aura-luna-en`	Luna	Female	US
`aura-stella-en`	Stella	Female	US
`aura-athena-en`	Athena	Female	UK
`aura-hera-en`	Hera	Female	US
`aura-orion-en`	Orion	Male	US
`aura-arcas-en`	Arcas	Male	US
`aura-perseus-en`	Perseus	Male	US
`aura-angus-en`	Angus	Male	Ireland
`aura-orpheus-en`	Orpheus	Male	US
`aura-helios-en`	Helios	Male	UK
`aura-zeus-en`	Zeus	Male	US

The default voice when using Deepgram is aura-asteria-en (Asteria).

Deepgram Aura does not support the voice_stability, voice_similarity, or voice_speed settings. Those fields are ElevenLabs-specific and are ignored when tts_provider is "deepgram".

When to Choose Deepgram

Cost-sensitive deployments — Deepgram TTS is significantly cheaper than ElevenLabs, making it a good fit for high-volume use cases.
Simpler architecture — REST-based synthesis means no WebSocket management or reconnection logic.
English-only calls — All Aura voices are English. If you only serve English-speaking callers, Deepgram covers your needs.

5. Choosing a Provider

Factor	ElevenLabs	Deepgram Aura
Latency	~200–500 ms (WebSocket streaming, incremental)	~300–600 ms (REST, waits for full text on flush)
Cost per minute	~$0.025/min	Lower (pay-per-character, generally cheaper)
Voice quality	Premium neural voices, highly natural	Good quality, slightly less natural
Voice selection	Large library + custom cloned voices	12 built-in voices
Language support	29+ languages (multilingual_v2 model)	English only
Voice tuning	Stability, similarity, speed controls	No tuning parameters
Connection type	Persistent WebSocket	REST API per flush
Barge-in handling	Close and reconnect WebSocket to cancel	AbortController aborts the HTTP request
Voicemail audio	Supported (REST API pre-generation)	Not supported — voicemail always uses ElevenLabs

You can set tts_provider to "deepgram" for your agent's live calls while still using ElevenLabs for voicemail pre-generation. Voicemail audio is always generated via the ElevenLabs REST API regardless of the live TTS provider.

6. Pronunciations

The pronunciations field is a JSONB array stored on each agent. It lets you define custom pronunciations for words that TTS engines commonly mispronounce — brand names, acronyms, technical terms, or proper nouns.

Format

Each entry is an object with word and pronunciation keys:

{
  "pronunciations": [
    { "word": "EWT",   "pronunciation": "E W T" },
    { "word": "HVAC",  "pronunciation": "H-vack" },
    { "word": "Acme",  "pronunciation": "Ak-mee" },
    { "word": "MySQL", "pronunciation": "My S Q L" },
    { "word": "GIF",   "pronunciation": "Jif" }
  ]
}

Dashboard Entry

In the agent editor, the pronunciation dictionary is a text area where you enter one mapping per line using the format word=pronunciation:

EWT=E W T
HVAC=H-vack
Acme=Ak-mee

The dashboard converts this to the JSON array format when saving.

Use phonetic spelling or spaced-out letters for acronyms. For example, API=A P I ensures the engine says each letter instead of trying to pronounce it as a word.

Pronunciations are injected into the text before it reaches the TTS engine. If a word appears in multiple forms (e.g., "HVAC" and "hvac"), you may need entries for each casing variant depending on how your LLM generates text.

7. Voicemail Messages

When enable_voicemail_detection is true (the default), the platform can detect answering machines. The voicemail_message field contains the text that your agent will leave as a voicemail.

How It Works

When you create or update an agent with both voicemail_message and voice_id set, the platform fires a background job to pre-generate the voicemail audio.
The audio is generated via the ElevenLabs REST API (not WebSocket), using the same voice and model as the agent.
Output format is MP3 at 44.1kHz / 128kbps, saved to voicemail-audio/{agent_id}.mp3.
The voicemail audio is regenerated automatically if you change the voicemail_message, voice_id, or elevenlabs_model.
If you clear the voicemail message, the MP3 file is deleted.

REST API Endpoint Used

POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?output_format=mp3_44100_128

{
  "text": "Hi, this is Sarah from Acme Corp. Sorry I missed your call...",
  "model_id": "eleven_flash_v2_5",
  "voice_settings": {
    "stability": 0.4,
    "similarity_boost": 0.5,
    "speed": 1.0
  }
}

Voicemail audio is always generated with ElevenLabs, even if your agent uses Deepgram for live TTS. You must have a valid ElevenLabs API key configured in your tenant settings for voicemail generation to work.

Keep voicemail messages under 30 seconds of speech (roughly 400–500 characters). Longer messages increase generation time and may be cut off by the caller's voicemail system.

8. Troubleshooting

Voice sounds robotic or flat

Your voice_stability is likely too high. Lower it from the default 0.4 toward 0.25–0.35 for a more expressive voice.
Check voice_similarity — if it is very low (below 0.3), the voice may sound generic. Increase it to 0.5.
Try a different voice. Some ElevenLabs voices naturally sound more animated than others. Rachel and Antoni tend to sound conversational out of the box.

Audio glitches or clipping at high speed

Speeds above 1.5x frequently produce artifacts, especially on longer sentences. Reduce voice_speed to 1.2 or lower.
If you need fast speech, consider keeping speed at 1.0 and instead instructing the LLM (via system prompt) to generate shorter, punchier responses.

WebSocket drops after ~15 seconds of silence

The ElevenLabs WebSocket will close if no text is sent for roughly 15 seconds. This can happen during long tool calls or when the caller is providing extended input.
The platform handles this by reconnecting on the next sendText() call, but it adds latency to that first response after the gap.
Mitigation: keep your idle_timeout_seconds (default 7.5s) below 15 seconds so the agent sends an idle prompt before the WebSocket times out.

ElevenLabs WebSocket connection timeout

The provider has a 5-second connection timeout. If the connection fails, check that your ElevenLabs API key is valid and that the voice_id exists in your account.
Verify your tenant settings have elevenlabs_api_key configured and the key has available character quota.

Deepgram TTS returns errors

Verify the voice_id matches one of the supported Aura model names exactly (e.g., aura-asteria-en).
Check that your Deepgram API key is set in tenant settings and has TTS permissions enabled.

Voicemail audio not generating

Both voicemail_message and voice_id must be set on the agent.
The tenant must have a valid elevenlabs_api_key in settings (voicemail always uses ElevenLabs).
Check server logs for errors from the agents logger — generation failures are logged but do not block the agent save.

Full Example: Professional Customer Service Agent

The following JSON creates an agent optimized for professional customer service calls with tuned voice settings:

{
  "name": "Support Agent - Sarah",
  "system_prompt": "You are Sarah, a professional and empathetic customer support agent for Acme Corp. Keep responses to 1-2 sentences. Listen carefully, ask clarifying questions, and provide clear solutions. If you cannot resolve the issue, offer to transfer to a specialist.",
  "first_message": "Hi, thanks for calling Acme support! How can I help you today?",
  "tone": "professional",
  "model": "openai/gpt-4o",
  "language": "en",

  "tts_provider": "elevenlabs",
  "voice_id": "21m00Tcm4TlvDq8ikWAM",
  "elevenlabs_model": "eleven_flash_v2_5",
  "voice_stability": 0.4,
  "voice_similarity": 0.5,
  "voice_speed": 1.0,

  "max_call_duration": 600,
  "endpointing": 250,
  "max_tokens": 200,
  "temperature": 0.7,

  "enable_voicemail_detection": true,
  "voicemail_message": "Hi, this is Sarah from Acme support. Sorry I missed your call. Please leave a message and I will have someone get back to you within the hour. Thanks!",

  "pronunciations": [
    { "word": "Acme", "pronunciation": "Ak-mee" },
    { "word": "SLA", "pronunciation": "S L A" }
  ],

  "end_call_phrases": ["goodbye", "bye", "that's all", "thanks bye"],
  "enable_call_analysis": true,
  "success_eval_prompt": "Was the customer's issue resolved? Did they seem satisfied?",
  "idle_timeout_seconds": 7.5,
  "idle_max_triggers": 3,
  "first_message_mode": "assistant-speaks-first"
}

Send this as a POST request to /api/agents with your authentication token to create the agent.