STT & Language Settings
Configure Speech-to-Text, language detection, noise handling, and transcription accuracy for your EWT Voice Agent.
1. Overview — How STT Works in the Pipeline
Every voice call processed by EWT Voice Agent follows a real-time pipeline. Understanding where STT fits helps you tune it effectively.
Audio arrives from Twilio as 20ms mulaw-encoded chunks at 8000 Hz, mono channel. These chunks are forwarded in real time to a persistent Deepgram WebSocket connection. Deepgram returns both interim (partial) and final transcription results, along with Voice Activity Detection (VAD) events. The final transcript is then sent to the LLM for response generation.
Key events emitted by the STT provider:
transcript— containstext,is_final,speech_final, andconfidenceutterance_end— fires when Deepgram detects the speaker has finished an utterancespeech_started— VAD event that fires the instant voice activity is detected, before any transcription is available
2. Deepgram Models
The STT model is configured via the stt_model column on the agent. The default is nova-2.
| Model | Best For | Notes |
|---|---|---|
nova-2 | General use (default) | Best overall accuracy and speed. Recommended for most agents. |
nova-2-general | Broad vocabulary | Slightly wider vocabulary coverage. Use when callers use diverse or uncommon terminology. |
nova-2-phonecall | Telephony audio | Optimized for 8kHz phone-quality audio. Can improve accuracy on noisy phone lines. |
nova-2-medical | Healthcare | Enhanced recognition of medical terminology, drug names, and clinical language. |
To change the model via the API:
PUT /api/agents/:id
Content-Type: application/json
{
"stt_model": "nova-2-phonecall"
}
nova-2 already performs well at 8kHz mulaw. Only switch to nova-2-phonecall if you notice accuracy issues on poor-quality lines.3. Language Support
Set the transcription language with the stt_language agent column (default: en). The platform also supports mid-call language switching via the built-in switch_language tool, which disconnects and reconnects the Deepgram WebSocket with the new language code.
| Language | Code | Language | Code |
|---|---|---|---|
| English | en | Japanese | ja |
| Spanish | es | Chinese (Mandarin) | zh |
| French | fr | Korean | ko |
| German | de | Dutch | nl |
| Portuguese | pt | Polish | pl |
| Italian | it | Russian | ru |
| Hindi | hi | Swedish | sv |
| Arabic | ar | Czech | cs |
| Turkish | tr | Romanian | ro |
Set the language when creating or updating an agent:
PUT /api/agents/:id
Content-Type: application/json
{
"stt_language": "es",
"language": "es"
}
stt_language changes only the transcription language. You should also set the agent's language field so the LLM and TTS respond in the same language. Mid-call switching via the switch_language tool updates both automatically.4. Endpointing
Endpointing controls how long Deepgram waits after the caller stops speaking before it considers the utterance complete. It is specified in milliseconds and stored in the endpointing agent column.
The platform passes this value directly to Deepgram's endpointing parameter. A separate utterance_end_ms of 1000ms acts as a hard backstop — if no speech is detected for 1 second, Deepgram fires an utterance_end event regardless.
How It Affects Responsiveness
- Lower value (100–200ms) — The agent responds faster, but may cut off callers mid-sentence, especially if they pause to think.
- Higher value (400–600ms) — The agent waits longer, allowing callers to finish complex thoughts, but feels slower.
| Scenario | Recommended Endpointing | Why |
|---|---|---|
| Fast-paced sales / lead qualification | 150–250ms | Quick back-and-forth feels natural for sales conversations |
| Customer support | 300–400ms | Gives callers time to describe problems without being cut off |
| Medical intake / legal | 400–600ms | Callers may pause to recall details; cutting off is unacceptable |
| Elderly callers / accessibility | 500–700ms | Slower speech pace requires longer pauses |
PUT /api/agents/:id
Content-Type: application/json
{
"endpointing": 400
}
CallSession applies a fallback of 400ms if no value is configured. Start at 400ms and decrease only if the agent feels sluggish.5. Barge-in
Barge-in allows callers to interrupt the agent while it is speaking. When the system detects that the caller has started talking over the agent, it immediately stops TTS playback and audio streaming, then switches to listening mode.
How It Works
Barge-in is triggered in three ways, whichever fires first:
- VAD speech_started — Deepgram's Voice Activity Detection fires the instant it detects voice, 200–400ms faster than waiting for transcript text. This triggers an immediate barge-in while the agent is speaking.
- Interim transcript length — If an interim (non-final) transcript exceeds
bargein_thresholdcharacters, barge-in fires. - Final transcript — Any final transcript received while the agent is speaking triggers barge-in.
The bargein_threshold Setting
This is the minimum number of characters in an interim transcript needed to trigger barge-in. Default is 3. Stored as the bargein_threshold column on the agent.
| Value | Behavior | Best For |
|---|---|---|
| 1–3 | Very sensitive — any detected speech interrupts immediately | Conversational, casual agents |
| 5–10 | Moderate — requires a short phrase before interrupting | Professional support, balanced feel |
| 15+ | Less sensitive — agent finishes more of its speech before yielding | Formal/scripted calls, IVR-style flows |
PUT /api/agents/:id
Content-Type: application/json
{
"bargein_threshold": 5
}
6. Keyword Spotting
Deepgram's keyword boosting improves recognition accuracy for specific terms that the model might otherwise miss or transcribe incorrectly. This is especially useful for proper nouns, brand names, product names, and industry-specific jargon.
Keywords are stored as a stt_keywords JSONB array on the agent and passed directly to Deepgram's keywords parameter when the WebSocket connection is established.
How It Works
When you provide keywords, Deepgram boosts the probability of recognizing those terms in the audio stream. The model does not exclusively listen for these words — it biases its existing language model toward them.
Use Cases
- Product names — "Salesforce", "HubSpot", "Zapier"
- Medical terms — "metformin", "lisinopril", "acetaminophen"
- Company-specific jargon — internal acronyms, project code names
- Proper nouns — uncommon personal names, city names, street addresses
PUT /api/agents/:id
Content-Type: application/json
{
"stt_keywords": [
"EWT",
"Salesforce",
"HubSpot",
"onboarding",
"HIPAA"
]
}
7. Filler Suppression
Filler suppression removes common filler words and hesitation markers from the agent's TTS output, producing cleaner and more professional-sounding speech. It is controlled by the enable_filler_suppression boolean column on the agent (default: false).
What Gets Removed
When enabled, the following filler words are stripped from LLM-generated text before it is sent to TTS:
um,uh,uhh,umm,hmm,er,ahlike,you know,I mean,sort of,kind of,
When to Enable vs. Disable
| Enable | Disable (Default) |
|---|---|
| Professional/corporate agents | Conversational, casual agents where fillers sound more human |
| Medical, legal, or financial use cases | Agents using "deny AI" mode where human-like speech is critical |
| IVR-style scripted flows | Any agent where natural-sounding pauses improve trust |
PUT /api/agents/:id
Content-Type: application/json
{
"enable_filler_suppression": true
}
8. Noise Suppression
EWT Voice Agent includes a client-side noise gate that suppresses low-amplitude audio chunks before they reach Deepgram. This prevents background noise from generating false transcripts or triggering barge-in.
Two agent columns control this behavior:
enable_noise_suppression(boolean, defaultfalse) — Enables the noise gate and also tells Deepgram to strip filler words from transcription results.noise_gate_threshold(integer 0–100, default0) — Sets the amplitude threshold. Audio chunks whose average amplitude falls below this level are silently dropped and never sent to Deepgram.
How the Noise Gate Works
Twilio sends mulaw-encoded audio where byte value 255 represents silence. The system calculates the average distance of each byte from 255 across the chunk. The threshold value (0–100) maps to an amplitude range of 0–50. If the average amplitude is below threshold / 2, the chunk is discarded.
| Threshold | Effect | Best For |
|---|---|---|
| 0 | No gating (all audio passes through) | Quiet environments, high-quality connections |
| 10–20 | Filters subtle background hum, AC noise | Office environments, standard phone calls |
| 30–50 | Filters moderate ambient noise | Call centers, busy offices |
| 60–80 | Aggressive filtering, only loud speech passes | Outdoor callers, construction sites, driving |
| 90–100 | Extremely aggressive — may drop quiet speech | Not recommended for most use cases |
PUT /api/agents/:id
Content-Type: application/json
{
"enable_noise_suppression": true,
"noise_gate_threshold": 25
}
9. Smart Formatting
Smart formatting is enabled by default (smart_format: true in the Deepgram connection options). It automatically formats common patterns in transcribed text so the LLM receives clean, structured input.
What Gets Formatted
| Spoken Input | Raw Transcript | Smart Formatted |
|---|---|---|
| "five five five one two three four" | five five five one two three four | 555-1234 |
| "twenty five dollars" | twenty five dollars | $25 |
| "january fifteenth twenty twenty six" | january fifteenth twenty twenty six | January 15th, 2026 |
| "three point one four" | three point one four | 3.14 |
| "my email is john at example dot com" | my email is john at example dot com | my email is john@example.com |
Smart formatting is always on and cannot be disabled per-agent. It works alongside punctuation (punctuate: true), which adds commas, periods, and question marks to the transcript.
10. Configuration Examples
Fast-Paced Sales Call
Optimized for quick back-and-forth, rapid response, easy barge-in:
{
"stt_model": "nova-2",
"stt_language": "en",
"endpointing": 200,
"bargein_threshold": 2,
"enable_filler_suppression": false,
"enable_noise_suppression": false,
"noise_gate_threshold": 0,
"stt_keywords": ["EWT", "demo", "pricing", "onboarding"]
}
Medical Intake (Accuracy-First)
Longer endpointing so patients are never cut off, keyword boosting for drug names, filler suppression for clean agent speech:
{
"stt_model": "nova-2-medical",
"stt_language": "en",
"endpointing": 500,
"bargein_threshold": 10,
"enable_filler_suppression": true,
"enable_noise_suppression": false,
"noise_gate_threshold": 0,
"stt_keywords": [
"metformin",
"lisinopril",
"acetaminophen",
"ibuprofen",
"amoxicillin",
"HIPAA"
]
}
Noisy Environment (Outdoor / Driving Callers)
Aggressive noise gating, higher barge-in threshold to prevent false triggers from wind or road noise:
{
"stt_model": "nova-2-phonecall",
"stt_language": "en",
"endpointing": 350,
"bargein_threshold": 8,
"enable_filler_suppression": true,
"enable_noise_suppression": true,
"noise_gate_threshold": 40,
"stt_keywords": []
}
11. Troubleshooting
Agent cuts off the caller mid-sentence
Cause: Endpointing is set too low. The agent interprets brief pauses as the end of an utterance.
Fix: Increase endpointing to 400–500ms. If the issue persists, check that callers are not on poor connections where packet delays create artificial gaps.
{ "endpointing": 450 }
Agent is too slow to respond
Cause: Endpointing is set too high, causing the agent to wait too long after the caller finishes speaking.
Fix: Decrease endpointing to 200–300ms. Note that the utterance_end_ms hard limit of 1000ms means the maximum wait will never exceed one second regardless of the endpointing value.
{ "endpointing": 250 }
Background noise causes false barge-in triggers
Cause: Deepgram's VAD detects ambient noise as speech, triggering barge-in and cutting off the agent's response.
Fix: Enable noise suppression and set a threshold. Also increase bargein_threshold so brief noise spikes are not enough to trigger an interruption.
{
"enable_noise_suppression": true,
"noise_gate_threshold": 30,
"bargein_threshold": 8
}
Specific keywords are not being recognized
Cause: Deepgram's language model does not have strong priors for uncommon terms like brand names, acronyms, or technical jargon.
Fix: Add the terms to stt_keywords. Use the exact casing and spelling you expect in the transcript. Keep the list under 30 items for best results.
{
"stt_keywords": ["Zapier", "HubSpot", "HIPAA", "SOC2"]
}
Transcripts are garbled or low-confidence
Cause: Language mismatch. The agent's stt_language does not match the language the caller is actually speaking.
Fix: Verify that stt_language matches the expected caller language. For multilingual use cases, consider instructing the LLM to use the switch_language tool when it detects the caller is speaking a different language.
Agent does not respond at all after caller speaks
Cause: Noise gate threshold is set too high, silently dropping all audio before it reaches Deepgram.
Fix: Lower noise_gate_threshold or disable noise suppression entirely. Check call logs for _audioChunksReceived — if this is 0 or very low relative to call duration, the noise gate is too aggressive.
{
"enable_noise_suppression": true,
"noise_gate_threshold": 15
}