STT & Language Settings

Configure Speech-to-Text, language detection, noise handling, and transcription accuracy for your EWT Voice Agent.

1. Overview — How STT Works in the Pipeline

Every voice call processed by EWT Voice Agent follows a real-time pipeline. Understanding where STT fits helps you tune it effectively.

Caller → Twilio (mulaw 8kHz audio) → Deepgram WebSocket (STT) → Transcript → LLM → Response text → TTS → Audio back to caller

Audio arrives from Twilio as 20ms mulaw-encoded chunks at 8000 Hz, mono channel. These chunks are forwarded in real time to a persistent Deepgram WebSocket connection. Deepgram returns both interim (partial) and final transcription results, along with Voice Activity Detection (VAD) events. The final transcript is then sent to the LLM for response generation.

Key events emitted by the STT provider:

The connection includes a keep-alive ping every 10 seconds to prevent WebSocket timeouts during long pauses in conversation.

2. Deepgram Models

The STT model is configured via the stt_model column on the agent. The default is nova-2.

ModelBest ForNotes
nova-2General use (default)Best overall accuracy and speed. Recommended for most agents.
nova-2-generalBroad vocabularySlightly wider vocabulary coverage. Use when callers use diverse or uncommon terminology.
nova-2-phonecallTelephony audioOptimized for 8kHz phone-quality audio. Can improve accuracy on noisy phone lines.
nova-2-medicalHealthcareEnhanced recognition of medical terminology, drug names, and clinical language.

To change the model via the API:

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_model": "nova-2-phonecall"
}
For phone-based agents handling telephony audio, nova-2 already performs well at 8kHz mulaw. Only switch to nova-2-phonecall if you notice accuracy issues on poor-quality lines.

3. Language Support

Set the transcription language with the stt_language agent column (default: en). The platform also supports mid-call language switching via the built-in switch_language tool, which disconnects and reconnects the Deepgram WebSocket with the new language code.

LanguageCodeLanguageCode
EnglishenJapaneseja
SpanishesChinese (Mandarin)zh
FrenchfrKoreanko
GermandeDutchnl
PortugueseptPolishpl
ItalianitRussianru
HindihiSwedishsv
ArabicarCzechcs
TurkishtrRomanianro

Set the language when creating or updating an agent:

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_language": "es",
  "language": "es"
}
Setting stt_language changes only the transcription language. You should also set the agent's language field so the LLM and TTS respond in the same language. Mid-call switching via the switch_language tool updates both automatically.

4. Endpointing

Endpointing controls how long Deepgram waits after the caller stops speaking before it considers the utterance complete. It is specified in milliseconds and stored in the endpointing agent column.

The platform passes this value directly to Deepgram's endpointing parameter. A separate utterance_end_ms of 1000ms acts as a hard backstop — if no speech is detected for 1 second, Deepgram fires an utterance_end event regardless.

How It Affects Responsiveness

ScenarioRecommended EndpointingWhy
Fast-paced sales / lead qualification150–250msQuick back-and-forth feels natural for sales conversations
Customer support300–400msGives callers time to describe problems without being cut off
Medical intake / legal400–600msCallers may pause to recall details; cutting off is unacceptable
Elderly callers / accessibility500–700msSlower speech pace requires longer pauses
PUT /api/agents/:id
Content-Type: application/json

{
  "endpointing": 400
}
The default endpointing in the database schema is 200ms, but CallSession applies a fallback of 400ms if no value is configured. Start at 400ms and decrease only if the agent feels sluggish.

5. Barge-in

Barge-in allows callers to interrupt the agent while it is speaking. When the system detects that the caller has started talking over the agent, it immediately stops TTS playback and audio streaming, then switches to listening mode.

How It Works

Barge-in is triggered in three ways, whichever fires first:

  1. VAD speech_started — Deepgram's Voice Activity Detection fires the instant it detects voice, 200–400ms faster than waiting for transcript text. This triggers an immediate barge-in while the agent is speaking.
  2. Interim transcript length — If an interim (non-final) transcript exceeds bargein_threshold characters, barge-in fires.
  3. Final transcript — Any final transcript received while the agent is speaking triggers barge-in.

The bargein_threshold Setting

This is the minimum number of characters in an interim transcript needed to trigger barge-in. Default is 3. Stored as the bargein_threshold column on the agent.

ValueBehaviorBest For
1–3Very sensitive — any detected speech interrupts immediatelyConversational, casual agents
5–10Moderate — requires a short phrase before interruptingProfessional support, balanced feel
15+Less sensitive — agent finishes more of its speech before yieldingFormal/scripted calls, IVR-style flows
PUT /api/agents/:id
Content-Type: application/json

{
  "bargein_threshold": 5
}
After barge-in, the system waits 800ms for the caller to finish their sentence before sending the accumulated transcript to the LLM. This prevents sending a half-finished thought for processing.
With VAD-based barge-in active, even background noise that Deepgram interprets as speech can trigger interruptions. If callers are in noisy environments, consider enabling noise suppression (Section 8) to reduce false barge-ins.

6. Keyword Spotting

Deepgram's keyword boosting improves recognition accuracy for specific terms that the model might otherwise miss or transcribe incorrectly. This is especially useful for proper nouns, brand names, product names, and industry-specific jargon.

Keywords are stored as a stt_keywords JSONB array on the agent and passed directly to Deepgram's keywords parameter when the WebSocket connection is established.

How It Works

When you provide keywords, Deepgram boosts the probability of recognizing those terms in the audio stream. The model does not exclusively listen for these words — it biases its existing language model toward them.

Use Cases

PUT /api/agents/:id
Content-Type: application/json

{
  "stt_keywords": [
    "EWT",
    "Salesforce",
    "HubSpot",
    "onboarding",
    "HIPAA"
  ]
}
Adding too many keywords (50+) can degrade overall transcription accuracy. Keep the list focused on terms that are genuinely being misrecognized. Test with real calls after adding keywords.

7. Filler Suppression

Filler suppression removes common filler words and hesitation markers from the agent's TTS output, producing cleaner and more professional-sounding speech. It is controlled by the enable_filler_suppression boolean column on the agent (default: false).

What Gets Removed

When enabled, the following filler words are stripped from LLM-generated text before it is sent to TTS:

When to Enable vs. Disable

EnableDisable (Default)
Professional/corporate agentsConversational, casual agents where fillers sound more human
Medical, legal, or financial use casesAgents using "deny AI" mode where human-like speech is critical
IVR-style scripted flowsAny agent where natural-sounding pauses improve trust
PUT /api/agents/:id
Content-Type: application/json

{
  "enable_filler_suppression": true
}
Filler suppression operates on the TTS output (the agent's speech), not on the caller's transcribed input. The caller's "um"s and "uh"s still appear in the transcript sent to the LLM, which is usually desirable since it preserves the caller's intent.

8. Noise Suppression

EWT Voice Agent includes a client-side noise gate that suppresses low-amplitude audio chunks before they reach Deepgram. This prevents background noise from generating false transcripts or triggering barge-in.

Two agent columns control this behavior:

How the Noise Gate Works

Twilio sends mulaw-encoded audio where byte value 255 represents silence. The system calculates the average distance of each byte from 255 across the chunk. The threshold value (0–100) maps to an amplitude range of 0–50. If the average amplitude is below threshold / 2, the chunk is discarded.

ThresholdEffectBest For
0No gating (all audio passes through)Quiet environments, high-quality connections
10–20Filters subtle background hum, AC noiseOffice environments, standard phone calls
30–50Filters moderate ambient noiseCall centers, busy offices
60–80Aggressive filtering, only loud speech passesOutdoor callers, construction sites, driving
90–100Extremely aggressive — may drop quiet speechNot recommended for most use cases
PUT /api/agents/:id
Content-Type: application/json

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 25
}
Setting the threshold too high (above 60) can cause the system to drop legitimate quiet speech, especially from soft-spoken callers or those on speakerphone. Always test with real calls before deploying aggressive thresholds to production.

9. Smart Formatting

Smart formatting is enabled by default (smart_format: true in the Deepgram connection options). It automatically formats common patterns in transcribed text so the LLM receives clean, structured input.

What Gets Formatted

Spoken InputRaw TranscriptSmart Formatted
"five five five one two three four"five five five one two three four555-1234
"twenty five dollars"twenty five dollars$25
"january fifteenth twenty twenty six"january fifteenth twenty twenty sixJanuary 15th, 2026
"three point one four"three point one four3.14
"my email is john at example dot com"my email is john at example dot commy email is john@example.com

Smart formatting is always on and cannot be disabled per-agent. It works alongside punctuation (punctuate: true), which adds commas, periods, and question marks to the transcript.

Smart formatting makes it much easier for the LLM to extract structured data (phone numbers, dates, dollar amounts) from caller speech. This is especially valuable for intake forms, appointment scheduling, and order-taking agents.

10. Configuration Examples

Fast-Paced Sales Call

Optimized for quick back-and-forth, rapid response, easy barge-in:

{
  "stt_model": "nova-2",
  "stt_language": "en",
  "endpointing": 200,
  "bargein_threshold": 2,
  "enable_filler_suppression": false,
  "enable_noise_suppression": false,
  "noise_gate_threshold": 0,
  "stt_keywords": ["EWT", "demo", "pricing", "onboarding"]
}

Medical Intake (Accuracy-First)

Longer endpointing so patients are never cut off, keyword boosting for drug names, filler suppression for clean agent speech:

{
  "stt_model": "nova-2-medical",
  "stt_language": "en",
  "endpointing": 500,
  "bargein_threshold": 10,
  "enable_filler_suppression": true,
  "enable_noise_suppression": false,
  "noise_gate_threshold": 0,
  "stt_keywords": [
    "metformin",
    "lisinopril",
    "acetaminophen",
    "ibuprofen",
    "amoxicillin",
    "HIPAA"
  ]
}

Noisy Environment (Outdoor / Driving Callers)

Aggressive noise gating, higher barge-in threshold to prevent false triggers from wind or road noise:

{
  "stt_model": "nova-2-phonecall",
  "stt_language": "en",
  "endpointing": 350,
  "bargein_threshold": 8,
  "enable_filler_suppression": true,
  "enable_noise_suppression": true,
  "noise_gate_threshold": 40,
  "stt_keywords": []
}

11. Troubleshooting

Agent cuts off the caller mid-sentence

Cause: Endpointing is set too low. The agent interprets brief pauses as the end of an utterance.

Fix: Increase endpointing to 400–500ms. If the issue persists, check that callers are not on poor connections where packet delays create artificial gaps.

{ "endpointing": 450 }

Agent is too slow to respond

Cause: Endpointing is set too high, causing the agent to wait too long after the caller finishes speaking.

Fix: Decrease endpointing to 200–300ms. Note that the utterance_end_ms hard limit of 1000ms means the maximum wait will never exceed one second regardless of the endpointing value.

{ "endpointing": 250 }

Background noise causes false barge-in triggers

Cause: Deepgram's VAD detects ambient noise as speech, triggering barge-in and cutting off the agent's response.

Fix: Enable noise suppression and set a threshold. Also increase bargein_threshold so brief noise spikes are not enough to trigger an interruption.

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 30,
  "bargein_threshold": 8
}

Specific keywords are not being recognized

Cause: Deepgram's language model does not have strong priors for uncommon terms like brand names, acronyms, or technical jargon.

Fix: Add the terms to stt_keywords. Use the exact casing and spelling you expect in the transcript. Keep the list under 30 items for best results.

{
  "stt_keywords": ["Zapier", "HubSpot", "HIPAA", "SOC2"]
}

Transcripts are garbled or low-confidence

Cause: Language mismatch. The agent's stt_language does not match the language the caller is actually speaking.

Fix: Verify that stt_language matches the expected caller language. For multilingual use cases, consider instructing the LLM to use the switch_language tool when it detects the caller is speaking a different language.

Agent does not respond at all after caller speaks

Cause: Noise gate threshold is set too high, silently dropping all audio before it reaches Deepgram.

Fix: Lower noise_gate_threshold or disable noise suppression entirely. Check call logs for _audioChunksReceived — if this is 0 or very low relative to call duration, the noise gate is too aggressive.

{
  "enable_noise_suppression": true,
  "noise_gate_threshold": 15
}