Choosing the Right LLM Model

How to pick the best language model for your voice agent -- balancing speed, intelligence, cost, and tool support.

1. Overview: The LLM in a Voice Call

The Large Language Model (LLM) is the brain of every voice agent. During a live phone call, the pipeline works like this:

STT (Deepgram Nova-2) transcribes the caller's speech into text.
LLM interprets the transcript, decides how to respond, and optionally invokes tools (book an appointment, look up a record, transfer the call).
TTS (ElevenLabs) converts the LLM's text response back into speech.

The LLM is typically both the most expensive and the most latency-sensitive component. Every millisecond the model takes to begin responding is time the caller spends waiting in silence. For natural-feeling conversations, you want the LLM's first token to arrive in under 500ms. That constraint is why model selection matters so much for voice.

The platform default model is anthropic/claude-haiku-4.5 -- it offers the best balance of speed, capability, and cost for most voice use cases.

2. Available Providers

OpenRouter (Primary)

All LLM calls go through OpenRouter, a unified API gateway that provides access to 200+ models from every major provider. The platform uses OpenRouter's OpenAI-compatible streaming API (/v1/chat/completions), so any model on OpenRouter works automatically.

Supported model families include:

Anthropic -- Claude Sonnet 4, Claude Haiku 4.5, Claude 3.5 Sonnet
OpenAI -- GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano
Google -- Gemini 2.5 Flash, Gemini 2.0 Flash, Gemini 2.5 Pro
Meta -- Llama 4 Maverick, Llama 3.3 70B
DeepSeek -- DeepSeek V3, DeepSeek R1
xAI -- Grok 3 Mini, Grok 3 Beta
Mistral -- Mistral Large, Mistral Small 3.1
Fast inference -- Groq Llama 3.3 70B, Cerebras Llama 3.3 70B

Custom Endpoints

You can point any agent at a self-hosted or third-party model by setting llm_base_url and llm_api_key in the agent configuration. The endpoint must expose an OpenAI-compatible /chat/completions route with streaming support. This works with Ollama, vLLM, LocalAI, Together AI, and any other OpenAI-compatible server. See Section 6 for details.

3. Model Discovery (300+ Models)

The dashboard model dropdown shows every model available through OpenRouter -- currently 300+. To help you navigate this, models are split into two tiers:

Proven Tier

Models that have been tested specifically for real-time voice conversations. Each model in this tier has a composite score derived from three factors:

Quality -- instruction-following, coherence, and accuracy
Speed -- time-to-first-token, weighted heavily for voice latency requirements
Cost -- per-minute LLM cost assuming ~600 input / 400 output tokens per minute

Models are ranked highest-to-lowest by composite score so the best all-around options appear at the top. You also see the estimated per-minute voice cost displayed next to each model name so you can make cost-aware decisions at a glance.

Untested Tier

Every other model on OpenRouter. These appear below the Proven models in the dropdown, clearly labeled "Untested -- use at your own risk." They may work fine, but they haven't been validated for voice latency or tool calling.

Untested models may have high latency, poor tool calling support, or inconsistent streaming behavior. Always make a test call before deploying an untested model to production.

You can search or scroll the dropdown to find any specific model. The search box filters both tiers simultaneously.

4. Recommended Models for Voice

The dashboard curates a set of models specifically suited for real-time voice conversations. The table below lists the top picks with estimated costs and latency. Cost per minute assumes roughly 600 input tokens and 400 output tokens per minute (about 4 conversational turns).

Model	Provider	$/min (LLM only)	Latency (est.)	Best For
Claude Haiku 4.5 DEFAULT	Anthropic	$0.0021	~400ms	Best all-around: fast, cheap, smart enough for most agents
GPT-4o Mini FAST	OpenAI	$0.0003	~300ms	Ultra-cheap, fast. Great for simple FAQ and info collection
GPT-4.1 Mini	OpenAI	$0.0009	~300ms	Newer GPT-4o Mini replacement, slightly smarter
Gemini 2.0 Flash CHEAPEST	Google	$0.0002	~250ms	Lowest cost, ultra-fast. Good for high-volume simple tasks
Gemini 2.5 Flash	Google	$0.0003	~250ms	Slightly smarter than 2.0 Flash at similar cost
GPT-4o	OpenAI	$0.0055	~600ms	Strong general-purpose. Customer support, sales, complex flows
Claude Sonnet 4 SMART	Anthropic	$0.0078	~800ms	High intelligence. Complex reasoning, nuanced conversations
GPT-4.1	OpenAI	$0.0044	~600ms	Latest OpenAI flagship. Strong instruction following
DeepSeek V3	DeepSeek	$0.0005	~500ms	Very cheap, decent quality. Budget-friendly option
Llama 3.3 70B	Meta	$0.0002	~300ms	Open-source, very cheap. Moderate tool calling support
Mistral Small 3.1	Mistral	$0.0004	~300ms	Small, fast, cheap. Good for European language support
Grok 3 Mini	xAI	$0.0004	~400ms	Compact reasoning model from xAI

The cost above is only the LLM component. A full voice call also includes STT ($0.0059/min), TTS (~$0.025/min), and transport ($0.018/min). See Cost Optimization for the full breakdown.

5. Temperature

Temperature controls how deterministic or creative the model's responses are. It is a value between 0 and 2 (though anything above 1.0 is rarely useful for voice).

0.0 -- Fully deterministic. The model always picks the most likely next word. Responses are consistent but can feel robotic.
0.3 - 0.5 -- Low randomness. Recommended for structured tasks: appointment scheduling, data collection, surveys, compliance-sensitive calls (e.g., debt collection).
0.6 - 0.8 -- Moderate randomness. Recommended for conversational agents: customer support, lead qualification, receptionists. This range feels natural without being unpredictable.
0.9 - 1.0 -- High randomness. Occasionally useful for creative or brainstorming agents, but responses may drift off-topic.

Avoid temperatures above 1.0 for voice agents. High temperature increases the chance of rambling, hallucinated responses, and broken tool calls -- all of which create a poor caller experience.

The platform default is 0.7, which works well for most conversational agents. You can set temperature per agent in the agent editor.

// Example: Setting temperature in agent config
{
  "model": "anthropic/claude-haiku-4.5",
  "temperature": 0.5,  // Lower for structured tasks
  "max_tokens": 300
}

6. Max Tokens

max_tokens controls the maximum length of the model's response per turn. It does not affect the input (what the model reads), only the output (what it generates). The platform default is 1024 tokens (~750 words).

Guidelines

150 - 250 tokens -- Short, snappy responses. Good for quick-answer agents, IVR-style menus, and simple confirmations. Reduces cost and latency.
300 - 500 tokens -- Standard conversational range. Works for most agents.
500 - 1024 tokens -- Longer explanations. Use for agents that need to read back policies, give detailed instructions, or summarize records.

Voice conversations are naturally short-turn. A 300-token response is already about 45 seconds of speech. If your agent is monologuing, lower max_tokens and instruct the model in the system prompt to keep responses to 1-2 sentences.

Impact of max_tokens:

Cost -- Output tokens are 2-5x more expensive than input tokens for most models. Reducing max_tokens directly reduces your worst-case cost per turn.
Latency -- More tokens = more time generating. However, since TTS starts as soon as the first sentence arrives (streaming), the impact on perceived latency is modest.

7. Custom LLM Endpoints

You can use any OpenAI-compatible LLM server by setting two fields in the agent configuration:

llm_base_url -- The base URL of your API server (e.g., http://localhost:11434/v1)
llm_api_key -- An API key if your server requires one. Leave blank for local servers.

When llm_base_url is set, the platform sends requests to {llm_base_url}/chat/completions instead of OpenRouter. The request format is identical -- OpenAI-compatible JSON with streaming (stream: true).

Ollama Example

Ollama runs models locally on your own hardware. To use it with a voice agent:

# 1. Install and start Ollama
brew install ollama
ollama serve

# 2. Pull a fast model
ollama pull llama3.1:8b

# 3. Verify it's running (Ollama exposes an OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Then configure your agent:

// Agent configuration for Ollama
{
  "model": "llama3.1:8b",
  "llm_base_url": "http://localhost:11434/v1",
  "llm_api_key": "",
  "temperature": 0.7,
  "max_tokens": 300
}

vLLM Example

# Start vLLM with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000

# Agent config
{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "llm_base_url": "http://your-gpu-server:8000/v1",
  "llm_api_key": "not-needed"
}

Custom endpoints must support SSE streaming (stream: true) with the OpenAI chat completions format. Non-streaming endpoints will not work. Also make sure your server is reachable from wherever the voice agent platform is running (Railway in production).

8. Cost Optimization

The dashboard includes a live cost estimator that breaks down the per-minute cost of a voice call into four components:

Component	Service	Cost/min	Notes
STT	Deepgram Nova-2	$0.0059	Fixed rate, always on during the call
LLM	Varies by model	$0.0002 - $0.016	Biggest variable. Depends on model choice
TTS	ElevenLabs	~$0.025	~300 characters/min at $0.000083/char
Transport	Twilio Voice + Media Streams	$0.018	$0.014 voice + $0.004 media stream

For a typical call using Claude Haiku 4.5, the total comes to roughly $0.051/min (~$3.06/hr). With GPT-4o Mini or Gemini Flash, the LLM cost is so low that TTS and transport dominate.

Strategies to Reduce Cost

Choose a cheaper model. Switching from Claude Sonnet 4 ($0.0078/min) to Claude Haiku 4.5 ($0.0021/min) saves ~73% on the LLM component. For simple tasks, GPT-4o Mini or Gemini Flash cost almost nothing.
Lower max_tokens. Output tokens are the most expensive part of LLM usage. If your agent only needs 1-2 sentence replies, set max_tokens to 150-250.
Keep calls short. Design your agent's system prompt to be efficient. Avoid open-ended chitchat. Use clear calls-to-action to move the conversation forward.
Use the cost estimator. In the agent editor, the cost and latency bars update live as you change models. Use this to compare options before deploying.

The LLM is often NOT the most expensive component. TTS (ElevenLabs) at ~$0.025/min frequently costs more than the LLM. If total cost is a concern, optimizing call duration and response length matters more than switching from a $0.002 model to a $0.0003 model.

9. Tool Calling Compatibility

Tool calling lets the LLM invoke functions during a conversation -- booking appointments, transferring calls, looking up records, tagging calls, and more. Not all models handle tool calling equally well.

Tool Calling Support by Model

Model	Tool Calling	Notes
Claude Haiku 4.5 / Sonnet 4	Excellent	Best-in-class tool use. Reliable argument parsing, rarely hallucinates tool names.
GPT-4o / GPT-4.1	Excellent	Native function calling. Very reliable with complex tool schemas.
GPT-4o Mini / GPT-4.1 Mini	Good	Works well for simple tools. May struggle with many tools or complex schemas.
Gemini 2.5 Flash / Pro	Good	Solid function calling support via OpenRouter.
Mistral Large	Good	Native function calling support.
Llama 3.3 70B / Llama 4	Moderate	Works for basic tools. Can misformat arguments with complex schemas.
DeepSeek V3	Moderate	Basic tool calling works. Not recommended for agents with many tools.
Mistral Small / Ministral 8B	Limited	May fail on multi-tool scenarios. Stick to no-tool or single-tool agents.
Self-hosted (Ollama, vLLM)	Varies	Depends on the model. Llama 3.1+ and Mistral have some support. Test thoroughly.

The platform converts Anthropic-style tool definitions to OpenAI function-calling format automatically. When a model returns tool call results, they are streamed incrementally and parsed as they arrive.

If your agent uses built-in tools (call transfer, appointment booking, call tagging, webhooks), choose a model with "Excellent" or "Good" tool calling support. A model that cannot reliably call tools will leave callers stuck.

10. Decision Flowchart

Use this quick-reference to pick a model based on your primary need:

What is your top priority? Lowest cost ---> Gemini 2.0 Flash ($0.0002/min) or GPT-4o Mini ($0.0003/min) Fastest response ---> Gemini 2.5 Flash (~250ms) or GPT-4o Mini (~300ms) Best all-around ---> Claude Haiku 4.5 (fast, cheap, great tool use) [DEFAULT] Smartest / complex reasoning ---> Claude Sonnet 4 or GPT-4.1 Many tools / complex schemas ---> Claude Haiku 4.5 or GPT-4o (best tool calling) Budget with decent quality ---> DeepSeek V3 ($0.0005/min) or Llama 3.3 70B Self-hosted / private data ---> Ollama + Llama 3.1 8B via custom endpoint European language support ---> Mistral Small 3.1 or Mistral Large High-volume simple calls ---> Gemini 2.0 Flash or GPT-4.1 Nano ($0.0002/min)

Common Scenarios

Appointment scheduler -- Claude Haiku 4.5 (temp 0.6, max_tokens 300). Needs reliable tool calling for booking.
Customer support -- GPT-4o (temp 0.7, max_tokens 500). Good balance of intelligence and speed.
Lead qualification -- GPT-4o or Claude Haiku 4.5 (temp 0.7, max_tokens 300). Needs tool calling for CRM integration.
Simple FAQ / info line -- GPT-4o Mini or Gemini Flash (temp 0.5, max_tokens 200). Speed and cost matter most.
Outbound sales -- GPT-4o or Claude Sonnet 4 (temp 0.7, max_tokens 400). Benefits from nuanced conversation ability.
Survey / data collection -- GPT-4o Mini (temp 0.5, max_tokens 200). Simple task, optimize for cost.
Compliance-sensitive (debt collection) -- GPT-4o or Claude Sonnet 4 (temp 0.5, max_tokens 400). Accuracy and instruction-following are critical.