Choosing the Right LLM Model

How to pick the best language model for your voice agent -- balancing speed, intelligence, cost, and tool support.

1. Overview: The LLM in a Voice Call

The Large Language Model (LLM) is the brain of every voice agent. During a live phone call, the pipeline works like this:

  1. STT (Deepgram Nova-2) transcribes the caller's speech into text.
  2. LLM interprets the transcript, decides how to respond, and optionally invokes tools (book an appointment, look up a record, transfer the call).
  3. TTS (ElevenLabs) converts the LLM's text response back into speech.

The LLM is typically both the most expensive and the most latency-sensitive component. Every millisecond the model takes to begin responding is time the caller spends waiting in silence. For natural-feeling conversations, you want the LLM's first token to arrive in under 500ms. That constraint is why model selection matters so much for voice.

The platform default model is anthropic/claude-haiku-4.5 -- it offers the best balance of speed, capability, and cost for most voice use cases.

2. Available Providers

OpenRouter (Primary)

All LLM calls go through OpenRouter, a unified API gateway that provides access to 200+ models from every major provider. The platform uses OpenRouter's OpenAI-compatible streaming API (/v1/chat/completions), so any model on OpenRouter works automatically.

Supported model families include:

Custom Endpoints

You can point any agent at a self-hosted or third-party model by setting llm_base_url and llm_api_key in the agent configuration. The endpoint must expose an OpenAI-compatible /chat/completions route with streaming support. This works with Ollama, vLLM, LocalAI, Together AI, and any other OpenAI-compatible server. See Section 6 for details.

3. Model Discovery (300+ Models)

The dashboard model dropdown shows every model available through OpenRouter -- currently 300+. To help you navigate this, models are split into two tiers:

Proven Tier

Models that have been tested specifically for real-time voice conversations. Each model in this tier has a composite score derived from three factors:

Models are ranked highest-to-lowest by composite score so the best all-around options appear at the top. You also see the estimated per-minute voice cost displayed next to each model name so you can make cost-aware decisions at a glance.

Untested Tier

Every other model on OpenRouter. These appear below the Proven models in the dropdown, clearly labeled "Untested -- use at your own risk." They may work fine, but they haven't been validated for voice latency or tool calling.

Untested models may have high latency, poor tool calling support, or inconsistent streaming behavior. Always make a test call before deploying an untested model to production.
You can search or scroll the dropdown to find any specific model. The search box filters both tiers simultaneously.

The dashboard curates a set of models specifically suited for real-time voice conversations. The table below lists the top picks with estimated costs and latency. Cost per minute assumes roughly 600 input tokens and 400 output tokens per minute (about 4 conversational turns).

ModelProvider$/min (LLM only)Latency (est.)Best For
Claude Haiku 4.5 DEFAULT Anthropic$0.0021~400ms Best all-around: fast, cheap, smart enough for most agents
GPT-4o Mini FAST OpenAI$0.0003~300ms Ultra-cheap, fast. Great for simple FAQ and info collection
GPT-4.1 MiniOpenAI$0.0009~300msNewer GPT-4o Mini replacement, slightly smarter
Gemini 2.0 Flash CHEAPEST Google$0.0002~250ms Lowest cost, ultra-fast. Good for high-volume simple tasks
Gemini 2.5 FlashGoogle$0.0003~250msSlightly smarter than 2.0 Flash at similar cost
GPT-4oOpenAI$0.0055~600msStrong general-purpose. Customer support, sales, complex flows
Claude Sonnet 4 SMART Anthropic$0.0078~800ms High intelligence. Complex reasoning, nuanced conversations
GPT-4.1OpenAI$0.0044~600msLatest OpenAI flagship. Strong instruction following
DeepSeek V3DeepSeek$0.0005~500msVery cheap, decent quality. Budget-friendly option
Llama 3.3 70BMeta$0.0002~300msOpen-source, very cheap. Moderate tool calling support
Mistral Small 3.1Mistral$0.0004~300msSmall, fast, cheap. Good for European language support
Grok 3 MinixAI$0.0004~400msCompact reasoning model from xAI
The cost above is only the LLM component. A full voice call also includes STT ($0.0059/min), TTS (~$0.025/min), and transport ($0.018/min). See Cost Optimization for the full breakdown.

5. Temperature

Temperature controls how deterministic or creative the model's responses are. It is a value between 0 and 2 (though anything above 1.0 is rarely useful for voice).

Avoid temperatures above 1.0 for voice agents. High temperature increases the chance of rambling, hallucinated responses, and broken tool calls -- all of which create a poor caller experience.

The platform default is 0.7, which works well for most conversational agents. You can set temperature per agent in the agent editor.

// Example: Setting temperature in agent config
{
  "model": "anthropic/claude-haiku-4.5",
  "temperature": 0.5,  // Lower for structured tasks
  "max_tokens": 300
}

6. Max Tokens

max_tokens controls the maximum length of the model's response per turn. It does not affect the input (what the model reads), only the output (what it generates). The platform default is 1024 tokens (~750 words).

Guidelines

Voice conversations are naturally short-turn. A 300-token response is already about 45 seconds of speech. If your agent is monologuing, lower max_tokens and instruct the model in the system prompt to keep responses to 1-2 sentences.

Impact of max_tokens:

7. Custom LLM Endpoints

You can use any OpenAI-compatible LLM server by setting two fields in the agent configuration:

When llm_base_url is set, the platform sends requests to {llm_base_url}/chat/completions instead of OpenRouter. The request format is identical -- OpenAI-compatible JSON with streaming (stream: true).

Ollama Example

Ollama runs models locally on your own hardware. To use it with a voice agent:

# 1. Install and start Ollama
brew install ollama
ollama serve

# 2. Pull a fast model
ollama pull llama3.1:8b

# 3. Verify it's running (Ollama exposes an OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Then configure your agent:

// Agent configuration for Ollama
{
  "model": "llama3.1:8b",
  "llm_base_url": "http://localhost:11434/v1",
  "llm_api_key": "",
  "temperature": 0.7,
  "max_tokens": 300
}

vLLM Example

# Start vLLM with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000

# Agent config
{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "llm_base_url": "http://your-gpu-server:8000/v1",
  "llm_api_key": "not-needed"
}
Custom endpoints must support SSE streaming (stream: true) with the OpenAI chat completions format. Non-streaming endpoints will not work. Also make sure your server is reachable from wherever the voice agent platform is running (Railway in production).

8. Cost Optimization

The dashboard includes a live cost estimator that breaks down the per-minute cost of a voice call into four components:

ComponentServiceCost/minNotes
STTDeepgram Nova-2$0.0059Fixed rate, always on during the call
LLMVaries by model$0.0002 - $0.016Biggest variable. Depends on model choice
TTSElevenLabs~$0.025~300 characters/min at $0.000083/char
TransportTwilio Voice + Media Streams$0.018$0.014 voice + $0.004 media stream

For a typical call using Claude Haiku 4.5, the total comes to roughly $0.051/min (~$3.06/hr). With GPT-4o Mini or Gemini Flash, the LLM cost is so low that TTS and transport dominate.

Strategies to Reduce Cost

  1. Choose a cheaper model. Switching from Claude Sonnet 4 ($0.0078/min) to Claude Haiku 4.5 ($0.0021/min) saves ~73% on the LLM component. For simple tasks, GPT-4o Mini or Gemini Flash cost almost nothing.
  2. Lower max_tokens. Output tokens are the most expensive part of LLM usage. If your agent only needs 1-2 sentence replies, set max_tokens to 150-250.
  3. Keep calls short. Design your agent's system prompt to be efficient. Avoid open-ended chitchat. Use clear calls-to-action to move the conversation forward.
  4. Use the cost estimator. In the agent editor, the cost and latency bars update live as you change models. Use this to compare options before deploying.
The LLM is often NOT the most expensive component. TTS (ElevenLabs) at ~$0.025/min frequently costs more than the LLM. If total cost is a concern, optimizing call duration and response length matters more than switching from a $0.002 model to a $0.0003 model.

9. Tool Calling Compatibility

Tool calling lets the LLM invoke functions during a conversation -- booking appointments, transferring calls, looking up records, tagging calls, and more. Not all models handle tool calling equally well.

Tool Calling Support by Model

ModelTool CallingNotes
Claude Haiku 4.5 / Sonnet 4 Excellent Best-in-class tool use. Reliable argument parsing, rarely hallucinates tool names.
GPT-4o / GPT-4.1 Excellent Native function calling. Very reliable with complex tool schemas.
GPT-4o Mini / GPT-4.1 Mini Good Works well for simple tools. May struggle with many tools or complex schemas.
Gemini 2.5 Flash / Pro Good Solid function calling support via OpenRouter.
Mistral Large Good Native function calling support.
Llama 3.3 70B / Llama 4 Moderate Works for basic tools. Can misformat arguments with complex schemas.
DeepSeek V3 Moderate Basic tool calling works. Not recommended for agents with many tools.
Mistral Small / Ministral 8B Limited May fail on multi-tool scenarios. Stick to no-tool or single-tool agents.
Self-hosted (Ollama, vLLM) Varies Depends on the model. Llama 3.1+ and Mistral have some support. Test thoroughly.

The platform converts Anthropic-style tool definitions to OpenAI function-calling format automatically. When a model returns tool call results, they are streamed incrementally and parsed as they arrive.

If your agent uses built-in tools (call transfer, appointment booking, call tagging, webhooks), choose a model with "Excellent" or "Good" tool calling support. A model that cannot reliably call tools will leave callers stuck.

10. Decision Flowchart

Use this quick-reference to pick a model based on your primary need:

What is your top priority? Lowest cost ---> Gemini 2.0 Flash ($0.0002/min) or GPT-4o Mini ($0.0003/min) Fastest response ---> Gemini 2.5 Flash (~250ms) or GPT-4o Mini (~300ms) Best all-around ---> Claude Haiku 4.5 (fast, cheap, great tool use) [DEFAULT] Smartest / complex reasoning ---> Claude Sonnet 4 or GPT-4.1 Many tools / complex schemas ---> Claude Haiku 4.5 or GPT-4o (best tool calling) Budget with decent quality ---> DeepSeek V3 ($0.0005/min) or Llama 3.3 70B Self-hosted / private data ---> Ollama + Llama 3.1 8B via custom endpoint European language support ---> Mistral Small 3.1 or Mistral Large High-volume simple calls ---> Gemini 2.0 Flash or GPT-4.1 Nano ($0.0002/min)

Common Scenarios