Voice & TTS Configuration

Configure Text-to-Speech providers, voices, and audio settings for your EWT Voice Agent.

1. Overview

Text-to-Speech (TTS) is the final stage of the voice pipeline. Every time your agent responds, the flow is:

  1. STT — Caller audio is transcribed to text (Deepgram Nova-2).
  2. LLM — The transcript is sent to the language model, which generates a reply.
  3. TTS — The reply text is converted into audio and streamed back to the caller.

TTS quality directly affects how callers perceive your agent. A natural-sounding voice builds trust and keeps callers engaged; a robotic or glitchy voice drives hang-ups. Latency matters too — the TTS provider adds 200–500 ms to each response cycle, so choosing the right model is a balance between quality, speed, and cost.

EWT Voice Agent supports two TTS providers:

You set the provider per-agent using the tts_provider field. The default is elevenlabs.

2. ElevenLabs

ElevenLabs is the default TTS provider. It connects via a persistent WebSocket to stream audio chunks in real time as the LLM generates text, enabling the lowest possible time-to-first-byte.

Available Models

Model IDDescriptionLatencyCostBest For
eleven_flash_v2_5Fastest model, optimized for low latency~200 msLowestReal-time phone calls (recommended default)
eleven_turbo_v2_5Balanced speed and quality~350 msMediumWhen you need slightly richer voice quality without sacrificing too much speed
eleven_multilingual_v2Full multilingual support with highest quality~500 msHighestNon-English calls or when voice quality is the top priority over latency
For most phone-based agents, eleven_flash_v2_5 is the best choice. It is the default when no model is specified. Only switch to eleven_multilingual_v2 if you need non-English voice support or premium quality for a demo.

How Streaming Works

The ElevenLabs provider opens a WebSocket connection to:

wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}&output_format=ulaw_8000

Audio is delivered as base64-encoded mu-law 8kHz chunks — the native format for telephony — so no transcoding is needed. The connection sends a BOS (beginning-of-stream) message with voice settings on open, streams text fragments via sendText(), and sends an empty text string to signal EOS (end-of-stream) via flush().

Pricing Reference

ElevenLabs charges per character. At roughly 300 characters per minute of speech, the platform estimates TTS cost at approximately $0.025/min for a typical call. This is reflected in the dashboard cost estimator.

3. ElevenLabs Voice Settings

voice_id

The voice_id determines which voice your agent uses. If not set, the default is pNInz6obpgDQGcFmaJgB (Adam). The dashboard groups voices into categories: Best for AI Callers, Professional, Cloned, Generated, and All Premade.

The following voices are curated as "Best for AI Callers" in the dashboard:

Voice NameVoice IDDescription
AdampNInz6obpgDQGcFmaJgBMale, deep and warm (default)
Rachel21m00Tcm4TlvDq8ikWAMFemale, calm and professional
AntoniErXwobaYiN019PkySvjVMale, friendly and conversational
BellaEXAVITQu4vr4xnSDxMaLFemale, soft and warm
ElliMF3mGyEYCl7XYWbV9V6OFemale, young and energetic
JoshTxGEqnHWrfWFTfGW9XjXMale, deep and authoritative
SamyoZ06aMxZJJ28mfd3POQMale, clear and neutral
JamesZQe5CZNOzWyzPSCn5a3cMale, authoritative
GigijBpfuIE2acCO8z3wKNLlFemale, animated and bright
DanielonwK4e9ZLuTAKqWW03F9Male, British accent

You can also use any voice from your ElevenLabs account, including cloned voices and voices generated with their Voice Design tool.

stability (0–1)

Controls how consistent the voice sounds across generations. Set via the voice_stability field on the agent (mapped to stability in the ElevenLabs API).

ValueBehavior
0.0 – 0.3More expressive and variable. The voice may sound more emotional but less predictable. Good for storytelling or casual tones.
0.3 – 0.6Balanced. Natural variation without being erratic. Recommended for most agents.
0.6 – 1.0Very stable and monotone. Sounds more robotic but extremely consistent. Use for IVR menus or reading data.

Default: 0.4

similarity_boost (0–1)

Controls how closely the output matches the original voice sample. Set via the voice_similarity field.

ValueBehavior
0.0 – 0.3Loose match. More generic but fewer artifacts. Good if the voice has quality issues.
0.3 – 0.7Balanced. Sounds like the target voice without amplifying recording imperfections.
0.7 – 1.0Very close match. Can amplify noise or artifacts from the source recording. Best for high-quality cloned voices.

Default: 0.5

speed (0.5–2.0)

Controls playback speed of the generated audio. Set via the voice_speed field.

ValueUse Case
0.5 – 0.8Slower speech. Good for elderly callers or complex information.
0.9 – 1.1Natural pace. Recommended for most agents.
1.2 – 1.5Faster speech. Good for concise transactional calls.
1.5 – 2.0Very fast. May introduce audio artifacts — test carefully before deploying.

Default: 1.0

Speeds above 1.5x can cause audible glitching and clipping artifacts, especially with longer sentences. Always test at your target speed with real call recordings before going live.

Recommended Settings by Use Case

Use CaseStabilitySimilaritySpeed
Professional customer service0.40.51.0
Warm sales / follow-up0.30.51.05
IVR / menu navigation0.70.50.95
Casual / conversational0.250.41.0
Reading data (numbers, addresses)0.60.50.9

4. Deepgram Aura

Deepgram Aura is an alternative TTS provider that uses a REST API instead of WebSockets. It accumulates text via sendText() and synthesizes the full buffer when flush() is called. Audio is returned as a streaming response in mu-law 8kHz format — no transcoding needed for telephony.

To use Deepgram, set tts_provider to "deepgram" on your agent. The voice is selected via voice_id, which maps to the Deepgram model name (e.g., aura-asteria-en).

Available Voices

Voice IDNameGenderAccent
aura-asteria-enAsteriaFemaleUS
aura-luna-enLunaFemaleUS
aura-stella-enStellaFemaleUS
aura-athena-enAthenaFemaleUK
aura-hera-enHeraFemaleUS
aura-orion-enOrionMaleUS
aura-arcas-enArcasMaleUS
aura-perseus-enPerseusMaleUS
aura-angus-enAngusMaleIreland
aura-orpheus-enOrpheusMaleUS
aura-helios-enHeliosMaleUK
aura-zeus-enZeusMaleUS

The default voice when using Deepgram is aura-asteria-en (Asteria).

Deepgram Aura does not support the voice_stability, voice_similarity, or voice_speed settings. Those fields are ElevenLabs-specific and are ignored when tts_provider is "deepgram".

When to Choose Deepgram

5. Choosing a Provider

FactorElevenLabsDeepgram Aura
Latency~200–500 ms (WebSocket streaming, incremental)~300–600 ms (REST, waits for full text on flush)
Cost per minute~$0.025/minLower (pay-per-character, generally cheaper)
Voice qualityPremium neural voices, highly naturalGood quality, slightly less natural
Voice selectionLarge library + custom cloned voices12 built-in voices
Language support29+ languages (multilingual_v2 model)English only
Voice tuningStability, similarity, speed controlsNo tuning parameters
Connection typePersistent WebSocketREST API per flush
Barge-in handlingClose and reconnect WebSocket to cancelAbortController aborts the HTTP request
Voicemail audioSupported (REST API pre-generation)Not supported — voicemail always uses ElevenLabs
You can set tts_provider to "deepgram" for your agent's live calls while still using ElevenLabs for voicemail pre-generation. Voicemail audio is always generated via the ElevenLabs REST API regardless of the live TTS provider.

6. Pronunciations

The pronunciations field is a JSONB array stored on each agent. It lets you define custom pronunciations for words that TTS engines commonly mispronounce — brand names, acronyms, technical terms, or proper nouns.

Format

Each entry is an object with word and pronunciation keys:

{
  "pronunciations": [
    { "word": "EWT",   "pronunciation": "E W T" },
    { "word": "HVAC",  "pronunciation": "H-vack" },
    { "word": "Acme",  "pronunciation": "Ak-mee" },
    { "word": "MySQL", "pronunciation": "My S Q L" },
    { "word": "GIF",   "pronunciation": "Jif" }
  ]
}

Dashboard Entry

In the agent editor, the pronunciation dictionary is a text area where you enter one mapping per line using the format word=pronunciation:

EWT=E W T
HVAC=H-vack
Acme=Ak-mee

The dashboard converts this to the JSON array format when saving.

Use phonetic spelling or spaced-out letters for acronyms. For example, API=A P I ensures the engine says each letter instead of trying to pronounce it as a word.
Pronunciations are injected into the text before it reaches the TTS engine. If a word appears in multiple forms (e.g., "HVAC" and "hvac"), you may need entries for each casing variant depending on how your LLM generates text.

7. Voicemail Messages

When enable_voicemail_detection is true (the default), the platform can detect answering machines. The voicemail_message field contains the text that your agent will leave as a voicemail.

How It Works

  1. When you create or update an agent with both voicemail_message and voice_id set, the platform fires a background job to pre-generate the voicemail audio.
  2. The audio is generated via the ElevenLabs REST API (not WebSocket), using the same voice and model as the agent.
  3. Output format is MP3 at 44.1kHz / 128kbps, saved to voicemail-audio/{agent_id}.mp3.
  4. The voicemail audio is regenerated automatically if you change the voicemail_message, voice_id, or elevenlabs_model.
  5. If you clear the voicemail message, the MP3 file is deleted.

REST API Endpoint Used

POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?output_format=mp3_44100_128

{
  "text": "Hi, this is Sarah from Acme Corp. Sorry I missed your call...",
  "model_id": "eleven_flash_v2_5",
  "voice_settings": {
    "stability": 0.4,
    "similarity_boost": 0.5,
    "speed": 1.0
  }
}
Voicemail audio is always generated with ElevenLabs, even if your agent uses Deepgram for live TTS. You must have a valid ElevenLabs API key configured in your tenant settings for voicemail generation to work.
Keep voicemail messages under 30 seconds of speech (roughly 400–500 characters). Longer messages increase generation time and may be cut off by the caller's voicemail system.

8. Troubleshooting

Voice sounds robotic or flat

Audio glitches or clipping at high speed

WebSocket drops after ~15 seconds of silence

ElevenLabs WebSocket connection timeout

Deepgram TTS returns errors

Voicemail audio not generating

Full Example: Professional Customer Service Agent

The following JSON creates an agent optimized for professional customer service calls with tuned voice settings:

{
  "name": "Support Agent - Sarah",
  "system_prompt": "You are Sarah, a professional and empathetic customer support agent for Acme Corp. Keep responses to 1-2 sentences. Listen carefully, ask clarifying questions, and provide clear solutions. If you cannot resolve the issue, offer to transfer to a specialist.",
  "first_message": "Hi, thanks for calling Acme support! How can I help you today?",
  "tone": "professional",
  "model": "openai/gpt-4o",
  "language": "en",

  "tts_provider": "elevenlabs",
  "voice_id": "21m00Tcm4TlvDq8ikWAM",
  "elevenlabs_model": "eleven_flash_v2_5",
  "voice_stability": 0.4,
  "voice_similarity": 0.5,
  "voice_speed": 1.0,

  "max_call_duration": 600,
  "endpointing": 250,
  "max_tokens": 200,
  "temperature": 0.7,

  "enable_voicemail_detection": true,
  "voicemail_message": "Hi, this is Sarah from Acme support. Sorry I missed your call. Please leave a message and I will have someone get back to you within the hour. Thanks!",

  "pronunciations": [
    { "word": "Acme", "pronunciation": "Ak-mee" },
    { "word": "SLA", "pronunciation": "S L A" }
  ],

  "end_call_phrases": ["goodbye", "bye", "that's all", "thanks bye"],
  "enable_call_analysis": true,
  "success_eval_prompt": "Was the customer's issue resolved? Did they seem satisfied?",
  "idle_timeout_seconds": 7.5,
  "idle_max_triggers": 3,
  "first_message_mode": "assistant-speaks-first"
}

Send this as a POST request to /api/agents with your authentication token to create the agent.