AI Integration · Production AI
LLM Routing in Production: OpenRouter, LiteLLM, and When Provider Failover Pays Off
Single-provider AI dependencies are a reliability risk. Routing layers like LiteLLM and OpenRouter let you fall back across providers, cap costs, and try smaller models first. Here is the architecture and when it actually matters.
Anurag Verma
6 min read
Sponsored
Production AI applications that call a single LLM provider have a single point of failure. When Anthropic had a partial outage in mid-2025, every application hardcoded to the Claude API degraded at the same time. When OpenAI’s rate limits hit hard in early 2026, teams that had invested in model routing shrugged while teams that hadn’t scrambled to patch fallback logic at 2am.
The analogy to databases is useful: you wouldn’t run production software with a single database instance and no failover. The same reasoning applies to LLM providers.
LLM routing is the layer that sits between your application code and one or more AI providers. It handles provider selection, failover, rate limit management, cost capping, and sometimes model selection based on request characteristics. Two tools dominate this space in 2026: LiteLLM (open-source, self-hosted or proxy) and OpenRouter (cloud service).
LiteLLM
LiteLLM started as a Python library that provides a unified API over 100+ models from different providers. You call it with OpenAI-compatible syntax; it handles translating the request to whatever the target provider needs.
from litellm import completion
# Call OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
# Call Claude — same API
response = completion(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Hello"}]
)
# Call Gemini — same API
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello"}]
)
The library also ships a proxy server — a drop-in replacement for the OpenAI API that your application talks to, while LiteLLM handles the multi-provider routing behind it.
# Start the LiteLLM proxy
litellm --model anthropic/claude-sonnet-4-6 --model gpt-4o --port 8000
Your application sends requests to http://localhost:8000 with OpenAI-style API calls. The proxy handles authentication, routing, retries, and logging. Applications that already use the OpenAI SDK need only one environment variable change to go through LiteLLM:
# Before: direct to OpenAI
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
# After: through LiteLLM proxy
client = OpenAI(
api_key="sk-anything", # LiteLLM handles auth to actual providers
base_url="http://localhost:8000"
)
Fallback Configuration
Fallbacks are where LiteLLM starts earning its place. A fallback configuration looks like this in YAML (the proxy’s config file):
model_list:
- model_name: claude-primary
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt-fallback
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: gemini-fallback
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GOOGLE_API_KEY
router_settings:
fallbacks:
- {"claude-primary": ["gpt-fallback", "gemini-fallback"]}
num_retries: 2
retry_after: 5 # seconds between retries
allowed_fails: 3 # failures before marking a model as unhealthy
cooldown_time: 60 # seconds before retrying an unhealthy model
When Claude returns a 503 or rate limit error, LiteLLM automatically falls back to GPT-4o. If that also fails, it tries Gemini. The failover is transparent to the calling application.
For Python direct usage (without the proxy):
from litellm import completion
response = completion(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Summarize this article."}],
fallbacks=["gpt-4o", "gemini/gemini-2.0-flash"],
num_retries=2,
)
Cost-Based Routing
A common pattern: use a cheaper or smaller model by default, fall back to a more expensive model only when the cheap one isn’t good enough (or falls back due to rate limits).
model_list:
- model_name: smart
litellm_params:
model: anthropic/claude-haiku-4-5
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: smarter
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks:
- {"smart": ["smarter"]}
Your application calls smart for all requests. If Haiku rate-limits or errors, it falls back to Sonnet. The team pays Haiku prices for the bulk of traffic and Sonnet prices only for overflow.
For explicit routing based on request type, the routing logic lives in your application:
def get_model_for_request(request_type: str) -> str:
if request_type in ("code_review", "architecture_analysis"):
return "anthropic/claude-sonnet-4-6" # Complex tasks
elif request_type in ("summarize", "classify", "extract"):
return "anthropic/claude-haiku-4-5" # Simple tasks
else:
return "openai/gpt-4o-mini" # Default cheap option
This kind of semantic routing is application-specific. LiteLLM handles the infrastructure concerns; routing by request type is business logic that belongs in your application.
OpenRouter
OpenRouter is a hosted API service that provides a single endpoint for 200+ models from most major providers. You send an OpenAI-compatible request to https://openrouter.ai/api/v1/chat/completions with a model name, and OpenRouter routes it to the provider.
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ['OPENROUTER_API_KEY'],
)
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6", # OpenRouter's model identifier format
messages=[{"role": "user", "content": "Hello"}],
)
OpenRouter’s advantages over LiteLLM:
- No infrastructure to run: no proxy server to deploy or maintain
- Single billing relationship: pay OpenRouter once, access all providers
- Automatic model fallback: configure in the request header or the OpenRouter dashboard
- Model comparison: try the same prompt across multiple models via a shared API
The trade-offs:
- Latency: requests go through OpenRouter’s servers before reaching the provider
- Vendor dependency: OpenRouter itself is a new dependency, with its own reliability characteristics
- Less control: less flexibility than self-hosted LiteLLM for complex routing logic
- Markup: OpenRouter charges above provider list prices (typically a small percentage)
For side projects, early-stage products, or situations where you want to compare models quickly, OpenRouter is the faster path. For production applications where latency, cost, and control matter, LiteLLM self-hosted is usually the better fit.
What to Route and What Not To
Not every LLM call needs a routing layer. The complexity cost of routing is real.
Worth routing when:
- Your application’s availability depends on the LLM being available (user-facing features, not internal tooling)
- You’re spending enough on LLM costs that model selection materially affects your budget
- You’ve already experienced provider outages affecting users
Not worth routing when:
- Development or low-traffic side projects
- Batch processing jobs that can queue and retry
- Single-model workflows where fallback would produce inconsistent output (fine-tuned models, specific capabilities only one provider has)
The routing layer adds two things worth the overhead at scale: reliability through fallback, and cost control through model selection. If neither of these is an active problem for your application today, invest the time elsewhere.
Monitoring
A routing layer only helps if you know when it’s activating. Both LiteLLM and OpenRouter provide logging hooks:
# LiteLLM: success/failure callbacks
import litellm
def log_success(kwargs, response, start_time, end_time):
print(f"Model used: {kwargs['model']}")
print(f"Tokens: {response.usage.total_tokens}")
print(f"Cost: ${litellm.completion_cost(response):.4f}")
def log_failure(kwargs, exception, start_time, end_time):
print(f"Failed model: {kwargs['model']}")
print(f"Exception: {exception}")
print(f"Fallback triggered: {kwargs.get('metadata', {}).get('fallback_model')}")
litellm.success_callback = [log_success]
litellm.failure_callback = [log_failure]
Track fallback rate over time. A spike in fallbacks is a leading indicator of provider issues before they become customer-visible. A consistent high fallback rate suggests your routing order is wrong — your primary model is unreliable for your traffic pattern, and you should reconsider the priority.
The Architecture Decision
The mental model that helps: treat your LLM provider(s) as infrastructure dependencies with availability characteristics, not as magic APIs that are always up. The routing layer is the same pattern as connection pooling, circuit breakers, and database read replicas — it makes your application resilient to external dependencies that are out of your control.
The question isn’t whether you need a routing layer. It’s when the cost of building one is lower than the cost of not having one. For most production AI features serving real users, that threshold hits earlier than you’d expect.
Sponsored
More from this category
More from AI Integration
AI in E-Commerce: What's Actually Working in 2026
AI-Assisted Technical Documentation: Keeping Docs Accurate When Code Changes Fast
The Vercel AI SDK in 2026: Streaming, Tool Calls, and Multi-Step Agents
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored