Choosing the Right LLM Model
How to pick the best language model for your voice agent -- balancing speed, intelligence, cost, and tool support.
1. Overview: The LLM in a Voice Call
The Large Language Model (LLM) is the brain of every voice agent. During a live phone call, the pipeline works like this:
- STT (Deepgram Nova-2) transcribes the caller's speech into text.
- LLM interprets the transcript, decides how to respond, and optionally invokes tools (book an appointment, look up a record, transfer the call).
- TTS (ElevenLabs) converts the LLM's text response back into speech.
The LLM is typically both the most expensive and the most latency-sensitive component. Every millisecond the model takes to begin responding is time the caller spends waiting in silence. For natural-feeling conversations, you want the LLM's first token to arrive in under 500ms. That constraint is why model selection matters so much for voice.
anthropic/claude-haiku-4.5 -- it offers the best balance of speed, capability, and cost for most voice use cases.2. Available Providers
OpenRouter (Primary)
All LLM calls go through OpenRouter, a unified API gateway that provides access to 200+ models from every major provider. The platform uses OpenRouter's OpenAI-compatible streaming API (/v1/chat/completions), so any model on OpenRouter works automatically.
Supported model families include:
- Anthropic -- Claude Sonnet 4, Claude Haiku 4.5, Claude 3.5 Sonnet
- OpenAI -- GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano
- Google -- Gemini 2.5 Flash, Gemini 2.0 Flash, Gemini 2.5 Pro
- Meta -- Llama 4 Maverick, Llama 3.3 70B
- DeepSeek -- DeepSeek V3, DeepSeek R1
- xAI -- Grok 3 Mini, Grok 3 Beta
- Mistral -- Mistral Large, Mistral Small 3.1
- Fast inference -- Groq Llama 3.3 70B, Cerebras Llama 3.3 70B
Custom Endpoints
You can point any agent at a self-hosted or third-party model by setting llm_base_url and llm_api_key in the agent configuration. The endpoint must expose an OpenAI-compatible /chat/completions route with streaming support. This works with Ollama, vLLM, LocalAI, Together AI, and any other OpenAI-compatible server. See Section 6 for details.
3. Model Discovery (300+ Models)
The dashboard model dropdown shows every model available through OpenRouter -- currently 300+. To help you navigate this, models are split into two tiers:
Proven Tier
Models that have been tested specifically for real-time voice conversations. Each model in this tier has a composite score derived from three factors:
- Quality -- instruction-following, coherence, and accuracy
- Speed -- time-to-first-token, weighted heavily for voice latency requirements
- Cost -- per-minute LLM cost assuming ~600 input / 400 output tokens per minute
Models are ranked highest-to-lowest by composite score so the best all-around options appear at the top. You also see the estimated per-minute voice cost displayed next to each model name so you can make cost-aware decisions at a glance.
Untested Tier
Every other model on OpenRouter. These appear below the Proven models in the dropdown, clearly labeled "Untested -- use at your own risk." They may work fine, but they haven't been validated for voice latency or tool calling.
4. Recommended Models for Voice
The dashboard curates a set of models specifically suited for real-time voice conversations. The table below lists the top picks with estimated costs and latency. Cost per minute assumes roughly 600 input tokens and 400 output tokens per minute (about 4 conversational turns).
| Model | Provider | $/min (LLM only) | Latency (est.) | Best For |
|---|---|---|---|---|
| Claude Haiku 4.5 DEFAULT | Anthropic | $0.0021 | ~400ms | Best all-around: fast, cheap, smart enough for most agents |
| GPT-4o Mini FAST | OpenAI | $0.0003 | ~300ms | Ultra-cheap, fast. Great for simple FAQ and info collection |
| GPT-4.1 Mini | OpenAI | $0.0009 | ~300ms | Newer GPT-4o Mini replacement, slightly smarter |
| Gemini 2.0 Flash CHEAPEST | $0.0002 | ~250ms | Lowest cost, ultra-fast. Good for high-volume simple tasks | |
| Gemini 2.5 Flash | $0.0003 | ~250ms | Slightly smarter than 2.0 Flash at similar cost | |
| GPT-4o | OpenAI | $0.0055 | ~600ms | Strong general-purpose. Customer support, sales, complex flows |
| Claude Sonnet 4 SMART | Anthropic | $0.0078 | ~800ms | High intelligence. Complex reasoning, nuanced conversations |
| GPT-4.1 | OpenAI | $0.0044 | ~600ms | Latest OpenAI flagship. Strong instruction following |
| DeepSeek V3 | DeepSeek | $0.0005 | ~500ms | Very cheap, decent quality. Budget-friendly option |
| Llama 3.3 70B | Meta | $0.0002 | ~300ms | Open-source, very cheap. Moderate tool calling support |
| Mistral Small 3.1 | Mistral | $0.0004 | ~300ms | Small, fast, cheap. Good for European language support |
| Grok 3 Mini | xAI | $0.0004 | ~400ms | Compact reasoning model from xAI |
5. Temperature
Temperature controls how deterministic or creative the model's responses are. It is a value between 0 and 2 (though anything above 1.0 is rarely useful for voice).
- 0.0 -- Fully deterministic. The model always picks the most likely next word. Responses are consistent but can feel robotic.
- 0.3 - 0.5 -- Low randomness. Recommended for structured tasks: appointment scheduling, data collection, surveys, compliance-sensitive calls (e.g., debt collection).
- 0.6 - 0.8 -- Moderate randomness. Recommended for conversational agents: customer support, lead qualification, receptionists. This range feels natural without being unpredictable.
- 0.9 - 1.0 -- High randomness. Occasionally useful for creative or brainstorming agents, but responses may drift off-topic.
The platform default is 0.7, which works well for most conversational agents. You can set temperature per agent in the agent editor.
// Example: Setting temperature in agent config
{
"model": "anthropic/claude-haiku-4.5",
"temperature": 0.5, // Lower for structured tasks
"max_tokens": 300
}
6. Max Tokens
max_tokens controls the maximum length of the model's response per turn. It does not affect the input (what the model reads), only the output (what it generates). The platform default is 1024 tokens (~750 words).
Guidelines
- 150 - 250 tokens -- Short, snappy responses. Good for quick-answer agents, IVR-style menus, and simple confirmations. Reduces cost and latency.
- 300 - 500 tokens -- Standard conversational range. Works for most agents.
- 500 - 1024 tokens -- Longer explanations. Use for agents that need to read back policies, give detailed instructions, or summarize records.
max_tokens and instruct the model in the system prompt to keep responses to 1-2 sentences.Impact of max_tokens:
- Cost -- Output tokens are 2-5x more expensive than input tokens for most models. Reducing max_tokens directly reduces your worst-case cost per turn.
- Latency -- More tokens = more time generating. However, since TTS starts as soon as the first sentence arrives (streaming), the impact on perceived latency is modest.
7. Custom LLM Endpoints
You can use any OpenAI-compatible LLM server by setting two fields in the agent configuration:
llm_base_url-- The base URL of your API server (e.g.,http://localhost:11434/v1)llm_api_key-- An API key if your server requires one. Leave blank for local servers.
When llm_base_url is set, the platform sends requests to {llm_base_url}/chat/completions instead of OpenRouter. The request format is identical -- OpenAI-compatible JSON with streaming (stream: true).
Ollama Example
Ollama runs models locally on your own hardware. To use it with a voice agent:
# 1. Install and start Ollama
brew install ollama
ollama serve
# 2. Pull a fast model
ollama pull llama3.1:8b
# 3. Verify it's running (Ollama exposes an OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
Then configure your agent:
// Agent configuration for Ollama
{
"model": "llama3.1:8b",
"llm_base_url": "http://localhost:11434/v1",
"llm_api_key": "",
"temperature": 0.7,
"max_tokens": 300
}
vLLM Example
# Start vLLM with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000
# Agent config
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"llm_base_url": "http://your-gpu-server:8000/v1",
"llm_api_key": "not-needed"
}
stream: true) with the OpenAI chat completions format. Non-streaming endpoints will not work. Also make sure your server is reachable from wherever the voice agent platform is running (Railway in production).8. Cost Optimization
The dashboard includes a live cost estimator that breaks down the per-minute cost of a voice call into four components:
| Component | Service | Cost/min | Notes |
|---|---|---|---|
| STT | Deepgram Nova-2 | $0.0059 | Fixed rate, always on during the call |
| LLM | Varies by model | $0.0002 - $0.016 | Biggest variable. Depends on model choice |
| TTS | ElevenLabs | ~$0.025 | ~300 characters/min at $0.000083/char |
| Transport | Twilio Voice + Media Streams | $0.018 | $0.014 voice + $0.004 media stream |
For a typical call using Claude Haiku 4.5, the total comes to roughly $0.051/min (~$3.06/hr). With GPT-4o Mini or Gemini Flash, the LLM cost is so low that TTS and transport dominate.
Strategies to Reduce Cost
- Choose a cheaper model. Switching from Claude Sonnet 4 ($0.0078/min) to Claude Haiku 4.5 ($0.0021/min) saves ~73% on the LLM component. For simple tasks, GPT-4o Mini or Gemini Flash cost almost nothing.
- Lower
max_tokens. Output tokens are the most expensive part of LLM usage. If your agent only needs 1-2 sentence replies, set max_tokens to 150-250. - Keep calls short. Design your agent's system prompt to be efficient. Avoid open-ended chitchat. Use clear calls-to-action to move the conversation forward.
- Use the cost estimator. In the agent editor, the cost and latency bars update live as you change models. Use this to compare options before deploying.
9. Tool Calling Compatibility
Tool calling lets the LLM invoke functions during a conversation -- booking appointments, transferring calls, looking up records, tagging calls, and more. Not all models handle tool calling equally well.
Tool Calling Support by Model
| Model | Tool Calling | Notes |
|---|---|---|
| Claude Haiku 4.5 / Sonnet 4 | Excellent | Best-in-class tool use. Reliable argument parsing, rarely hallucinates tool names. |
| GPT-4o / GPT-4.1 | Excellent | Native function calling. Very reliable with complex tool schemas. |
| GPT-4o Mini / GPT-4.1 Mini | Good | Works well for simple tools. May struggle with many tools or complex schemas. |
| Gemini 2.5 Flash / Pro | Good | Solid function calling support via OpenRouter. |
| Mistral Large | Good | Native function calling support. |
| Llama 3.3 70B / Llama 4 | Moderate | Works for basic tools. Can misformat arguments with complex schemas. |
| DeepSeek V3 | Moderate | Basic tool calling works. Not recommended for agents with many tools. |
| Mistral Small / Ministral 8B | Limited | May fail on multi-tool scenarios. Stick to no-tool or single-tool agents. |
| Self-hosted (Ollama, vLLM) | Varies | Depends on the model. Llama 3.1+ and Mistral have some support. Test thoroughly. |
The platform converts Anthropic-style tool definitions to OpenAI function-calling format automatically. When a model returns tool call results, they are streamed incrementally and parsed as they arrive.
10. Decision Flowchart
Use this quick-reference to pick a model based on your primary need:
Common Scenarios
- Appointment scheduler -- Claude Haiku 4.5 (temp 0.6, max_tokens 300). Needs reliable tool calling for booking.
- Customer support -- GPT-4o (temp 0.7, max_tokens 500). Good balance of intelligence and speed.
- Lead qualification -- GPT-4o or Claude Haiku 4.5 (temp 0.7, max_tokens 300). Needs tool calling for CRM integration.
- Simple FAQ / info line -- GPT-4o Mini or Gemini Flash (temp 0.5, max_tokens 200). Speed and cost matter most.
- Outbound sales -- GPT-4o or Claude Sonnet 4 (temp 0.7, max_tokens 400). Benefits from nuanced conversation ability.
- Survey / data collection -- GPT-4o Mini (temp 0.5, max_tokens 200). Simple task, optimize for cost.
- Compliance-sensitive (debt collection) -- GPT-4o or Claude Sonnet 4 (temp 0.5, max_tokens 400). Accuracy and instruction-following are critical.