STT & Language Settings

Configure Speech-to-Text, language detection, noise handling, and transcription accuracy for your EWT Voice Agent.

1. Overview — How STT Works in the Pipeline

Every voice call processed by EWT Voice Agent follows a real-time pipeline. Understanding where STT fits helps you tune it effectively.

Caller → Twilio (mulaw 8kHz audio) → Deepgram WebSocket (STT) → Transcript → LLM → Response text → TTS → Audio back to caller

Audio arrives from Twilio as 20ms mulaw-encoded chunks at 8000 Hz, mono channel. These chunks are forwarded in real time to a persistent Deepgram WebSocket connection. Deepgram returns both interim (partial) and final transcription results, along with Voice Activity Detection (VAD) events. The final transcript is then sent to the LLM for response generation.

Key events emitted by the STT provider:

transcript — contains text, is_final, speech_final, and confidence
utterance_end — fires when Deepgram detects the speaker has finished an utterance
speech_started — VAD event that fires the instant voice activity is detected, before any transcription is available

The connection includes a keep-alive ping every 10 seconds to prevent WebSocket timeouts during long pauses in conversation.

2. Deepgram Models

The STT model is configured via the stt_model column on the agent. The default is nova-2.

Model	Best For	Notes
`nova-2`	General use (default)	Best overall accuracy and speed. Recommended for most agents.
`nova-2-general`	Broad vocabulary	Slightly wider vocabulary coverage. Use when callers use diverse or uncommon terminology.
`nova-2-phonecall`	Telephony audio	Optimized for 8kHz phone-quality audio. Can improve accuracy on noisy phone lines.
`nova-2-medical`	Healthcare	Enhanced recognition of medical terminology, drug names, and clinical language.

To change the model via the API:

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_model": "nova-2-phonecall"
}

For phone-based agents handling telephony audio, nova-2 already performs well at 8kHz mulaw. Only switch to nova-2-phonecall if you notice accuracy issues on poor-quality lines.

3. Language Support

Set the transcription language with the stt_language agent column (default: en). The platform also supports mid-call language switching via the built-in switch_language tool, which disconnects and reconnects the Deepgram WebSocket with the new language code.

Language	Code	Language	Code
English	`en`	Japanese	`ja`
Spanish	`es`	Chinese (Mandarin)	`zh`
French	`fr`	Korean	`ko`
German	`de`	Dutch	`nl`
Portuguese	`pt`	Polish	`pl`
Italian	`it`	Russian	`ru`
Hindi	`hi`	Swedish	`sv`
Arabic	`ar`	Czech	`cs`
Turkish	`tr`	Romanian	`ro`

Set the language when creating or updating an agent:

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_language": "es",
  "language": "es"
}

Setting stt_language changes only the transcription language. You should also set the agent's language field so the LLM and TTS respond in the same language. Mid-call switching via the switch_language tool updates both automatically.

4. Endpointing

Endpointing controls how long Deepgram waits after the caller stops speaking before it considers the utterance complete. It is specified in milliseconds and stored in the endpointing agent column.

The platform passes this value directly to Deepgram's endpointing parameter. A separate utterance_end_ms of 1000ms acts as a hard backstop — if no speech is detected for 1 second, Deepgram fires an utterance_end event regardless.

How It Affects Responsiveness

Lower value (100–200ms) — The agent responds faster, but may cut off callers mid-sentence, especially if they pause to think.
Higher value (400–600ms) — The agent waits longer, allowing callers to finish complex thoughts, but feels slower.

Scenario	Recommended Endpointing	Why
Fast-paced sales / lead qualification	150–250ms	Quick back-and-forth feels natural for sales conversations
Customer support	300–400ms	Gives callers time to describe problems without being cut off
Medical intake / legal	400–600ms	Callers may pause to recall details; cutting off is unacceptable
Elderly callers / accessibility	500–700ms	Slower speech pace requires longer pauses

PUT /api/agents/:id
Content-Type: application/json

{
  "endpointing": 400
}

The default endpointing in the database schema is 200ms, but CallSession applies a fallback of 400ms if no value is configured. Start at 400ms and decrease only if the agent feels sluggish.

5. Barge-in

Barge-in allows callers to interrupt the agent while it is speaking. When the system detects that the caller has started talking over the agent, it immediately stops TTS playback and audio streaming, then switches to listening mode.

How It Works

Barge-in is triggered in three ways, whichever fires first:

VAD speech_started — Deepgram's Voice Activity Detection fires the instant it detects voice, 200–400ms faster than waiting for transcript text. This triggers an immediate barge-in while the agent is speaking.
Interim transcript length — If an interim (non-final) transcript exceeds bargein_threshold characters, barge-in fires.
Final transcript — Any final transcript received while the agent is speaking triggers barge-in.

The `bargein_threshold` Setting

This is the minimum number of characters in an interim transcript needed to trigger barge-in. Default is 3. Stored as the bargein_threshold column on the agent.

Value	Behavior	Best For
1–3	Very sensitive — any detected speech interrupts immediately	Conversational, casual agents
5–10	Moderate — requires a short phrase before interrupting	Professional support, balanced feel
15+	Less sensitive — agent finishes more of its speech before yielding	Formal/scripted calls, IVR-style flows

PUT /api/agents/:id
Content-Type: application/json

{
  "bargein_threshold": 5
}

After barge-in, the system waits 800ms for the caller to finish their sentence before sending the accumulated transcript to the LLM. This prevents sending a half-finished thought for processing.

With VAD-based barge-in active, even background noise that Deepgram interprets as speech can trigger interruptions. If callers are in noisy environments, consider enabling noise suppression (Section 8) to reduce false barge-ins.

6. Keyword Spotting

Deepgram's keyword boosting improves recognition accuracy for specific terms that the model might otherwise miss or transcribe incorrectly. This is especially useful for proper nouns, brand names, product names, and industry-specific jargon.

Keywords are stored as a stt_keywords JSONB array on the agent and passed directly to Deepgram's keywords parameter when the WebSocket connection is established.

How It Works

When you provide keywords, Deepgram boosts the probability of recognizing those terms in the audio stream. The model does not exclusively listen for these words — it biases its existing language model toward them.

Use Cases

Product names — "Salesforce", "HubSpot", "Zapier"
Medical terms — "metformin", "lisinopril", "acetaminophen"
Company-specific jargon — internal acronyms, project code names
Proper nouns — uncommon personal names, city names, street addresses

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_keywords": [
    "EWT",
    "Salesforce",
    "HubSpot",
    "onboarding",
    "HIPAA"
  ]
}

Adding too many keywords (50+) can degrade overall transcription accuracy. Keep the list focused on terms that are genuinely being misrecognized. Test with real calls after adding keywords.

7. Filler Suppression

Filler suppression removes common filler words and hesitation markers from the agent's TTS output, producing cleaner and more professional-sounding speech. It is controlled by the enable_filler_suppression boolean column on the agent (default: false).

What Gets Removed

When enabled, the following filler words are stripped from LLM-generated text before it is sent to TTS:

um, uh, uhh, umm, hmm, er, ah
like, you know, I mean, sort of, kind of,

When to Enable vs. Disable

Enable	Disable (Default)
Professional/corporate agents	Conversational, casual agents where fillers sound more human
Medical, legal, or financial use cases	Agents using "deny AI" mode where human-like speech is critical
IVR-style scripted flows	Any agent where natural-sounding pauses improve trust

PUT /api/agents/:id
Content-Type: application/json

{
  "enable_filler_suppression": true
}

Filler suppression operates on the TTS output (the agent's speech), not on the caller's transcribed input. The caller's "um"s and "uh"s still appear in the transcript sent to the LLM, which is usually desirable since it preserves the caller's intent.

8. Noise Suppression

EWT Voice Agent includes a client-side noise gate that suppresses low-amplitude audio chunks before they reach Deepgram. This prevents background noise from generating false transcripts or triggering barge-in.

Two agent columns control this behavior:

enable_noise_suppression (boolean, default false) — Enables the noise gate and also tells Deepgram to strip filler words from transcription results.
noise_gate_threshold (integer 0–100, default 0) — Sets the amplitude threshold. Audio chunks whose average amplitude falls below this level are silently dropped and never sent to Deepgram.

How the Noise Gate Works

Twilio sends mulaw-encoded audio where byte value 255 represents silence. The system calculates the average distance of each byte from 255 across the chunk. The threshold value (0–100) maps to an amplitude range of 0–50. If the average amplitude is below threshold / 2, the chunk is discarded.

Threshold	Effect	Best For
0	No gating (all audio passes through)	Quiet environments, high-quality connections
10–20	Filters subtle background hum, AC noise	Office environments, standard phone calls
30–50	Filters moderate ambient noise	Call centers, busy offices
60–80	Aggressive filtering, only loud speech passes	Outdoor callers, construction sites, driving
90–100	Extremely aggressive — may drop quiet speech	Not recommended for most use cases

PUT /api/agents/:id
Content-Type: application/json

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 25
}

Setting the threshold too high (above 60) can cause the system to drop legitimate quiet speech, especially from soft-spoken callers or those on speakerphone. Always test with real calls before deploying aggressive thresholds to production.

9. Smart Formatting

Smart formatting is enabled by default (smart_format: true in the Deepgram connection options). It automatically formats common patterns in transcribed text so the LLM receives clean, structured input.

What Gets Formatted

Spoken Input	Raw Transcript	Smart Formatted
"five five five one two three four"	five five five one two three four	555-1234
"twenty five dollars"	twenty five dollars	$25
"january fifteenth twenty twenty six"	january fifteenth twenty twenty six	January 15th, 2026
"three point one four"	three point one four	3.14
"my email is john at example dot com"	my email is john at example dot com	my email is john@example.com

Smart formatting is always on and cannot be disabled per-agent. It works alongside punctuation (punctuate: true), which adds commas, periods, and question marks to the transcript.

Smart formatting makes it much easier for the LLM to extract structured data (phone numbers, dates, dollar amounts) from caller speech. This is especially valuable for intake forms, appointment scheduling, and order-taking agents.

10. Configuration Examples

Fast-Paced Sales Call

Optimized for quick back-and-forth, rapid response, easy barge-in:

{
  "stt_model": "nova-2",
  "stt_language": "en",
  "endpointing": 200,
  "bargein_threshold": 2,
  "enable_filler_suppression": false,
  "enable_noise_suppression": false,
  "noise_gate_threshold": 0,
  "stt_keywords": ["EWT", "demo", "pricing", "onboarding"]
}

Medical Intake (Accuracy-First)

Longer endpointing so patients are never cut off, keyword boosting for drug names, filler suppression for clean agent speech:

{
  "stt_model": "nova-2-medical",
  "stt_language": "en",
  "endpointing": 500,
  "bargein_threshold": 10,
  "enable_filler_suppression": true,
  "enable_noise_suppression": false,
  "noise_gate_threshold": 0,
  "stt_keywords": [
    "metformin",
    "lisinopril",
    "acetaminophen",
    "ibuprofen",
    "amoxicillin",
    "HIPAA"
  ]
}

Noisy Environment (Outdoor / Driving Callers)

Aggressive noise gating, higher barge-in threshold to prevent false triggers from wind or road noise:

{
  "stt_model": "nova-2-phonecall",
  "stt_language": "en",
  "endpointing": 350,
  "bargein_threshold": 8,
  "enable_filler_suppression": true,
  "enable_noise_suppression": true,
  "noise_gate_threshold": 40,
  "stt_keywords": []
}

11. Troubleshooting

Agent cuts off the caller mid-sentence

Cause: Endpointing is set too low. The agent interprets brief pauses as the end of an utterance.

Fix: Increase endpointing to 400–500ms. If the issue persists, check that callers are not on poor connections where packet delays create artificial gaps.

{ "endpointing": 450 }

Agent is too slow to respond

Cause: Endpointing is set too high, causing the agent to wait too long after the caller finishes speaking.

Fix: Decrease endpointing to 200–300ms. Note that the utterance_end_ms hard limit of 1000ms means the maximum wait will never exceed one second regardless of the endpointing value.

{ "endpointing": 250 }

Background noise causes false barge-in triggers

Cause: Deepgram's VAD detects ambient noise as speech, triggering barge-in and cutting off the agent's response.

Fix: Enable noise suppression and set a threshold. Also increase bargein_threshold so brief noise spikes are not enough to trigger an interruption.

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 30,
  "bargein_threshold": 8
}

Specific keywords are not being recognized

Cause: Deepgram's language model does not have strong priors for uncommon terms like brand names, acronyms, or technical jargon.

Fix: Add the terms to stt_keywords. Use the exact casing and spelling you expect in the transcript. Keep the list under 30 items for best results.

{
  "stt_keywords": ["Zapier", "HubSpot", "HIPAA", "SOC2"]
}

Transcripts are garbled or low-confidence

Cause: Language mismatch. The agent's stt_language does not match the language the caller is actually speaking.

Fix: Verify that stt_language matches the expected caller language. For multilingual use cases, consider instructing the LLM to use the switch_language tool when it detects the caller is speaking a different language.

Agent does not respond at all after caller speaks

Cause: Noise gate threshold is set too high, silently dropping all audio before it reaches Deepgram.

Fix: Lower noise_gate_threshold or disable noise suppression entirely. Check call logs for _audioChunksReceived — if this is 0 or very low relative to call duration, the noise gate is too aggressive.

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 15
}

STT & Language Settings

1. Overview — How STT Works in the Pipeline

2. Deepgram Models

3. Language Support

4. Endpointing

How It Affects Responsiveness

5. Barge-in

How It Works

The bargein_threshold Setting

6. Keyword Spotting

How It Works

Use Cases

7. Filler Suppression

What Gets Removed

When to Enable vs. Disable

8. Noise Suppression

How the Noise Gate Works

9. Smart Formatting

What Gets Formatted

10. Configuration Examples

Fast-Paced Sales Call

Medical Intake (Accuracy-First)

Noisy Environment (Outdoor / Driving Callers)

11. Troubleshooting

Agent cuts off the caller mid-sentence

Agent is too slow to respond

Background noise causes false barge-in triggers

Specific keywords are not being recognized

Transcripts are garbled or low-confidence

Agent does not respond at all after caller speaks

The `bargein_threshold` Setting