Skip to content

AI Integration · Production AI

LLM Routing in Production: OpenRouter, LiteLLM, and When Provider Failover Pays Off

Single-provider AI dependencies are a reliability risk. Routing layers like LiteLLM and OpenRouter let you fall back across providers, cap costs, and try smaller models first. Here is the architecture and when it actually matters.

Anurag Verma

Anurag Verma

6 min read

LLM Routing in Production: OpenRouter, LiteLLM, and When Provider Failover Pays Off

Sponsored

Share

Production AI applications that call a single LLM provider have a single point of failure. When Anthropic had a partial outage in mid-2025, every application hardcoded to the Claude API degraded at the same time. When OpenAI’s rate limits hit hard in early 2026, teams that had invested in model routing shrugged while teams that hadn’t scrambled to patch fallback logic at 2am.

The analogy to databases is useful: you wouldn’t run production software with a single database instance and no failover. The same reasoning applies to LLM providers.

LLM routing is the layer that sits between your application code and one or more AI providers. It handles provider selection, failover, rate limit management, cost capping, and sometimes model selection based on request characteristics. Two tools dominate this space in 2026: LiteLLM (open-source, self-hosted or proxy) and OpenRouter (cloud service).

LiteLLM

LiteLLM started as a Python library that provides a unified API over 100+ models from different providers. You call it with OpenAI-compatible syntax; it handles translating the request to whatever the target provider needs.

from litellm import completion

# Call OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

# Call Claude — same API
response = completion(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}]
)

# Call Gemini — same API
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Hello"}]
)

The library also ships a proxy server — a drop-in replacement for the OpenAI API that your application talks to, while LiteLLM handles the multi-provider routing behind it.

# Start the LiteLLM proxy
litellm --model anthropic/claude-sonnet-4-6 --model gpt-4o --port 8000

Your application sends requests to http://localhost:8000 with OpenAI-style API calls. The proxy handles authentication, routing, retries, and logging. Applications that already use the OpenAI SDK need only one environment variable change to go through LiteLLM:

# Before: direct to OpenAI
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# After: through LiteLLM proxy
client = OpenAI(
    api_key="sk-anything",  # LiteLLM handles auth to actual providers
    base_url="http://localhost:8000"
)

Fallback Configuration

Fallbacks are where LiteLLM starts earning its place. A fallback configuration looks like this in YAML (the proxy’s config file):

model_list:
  - model_name: claude-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gpt-fallback
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gemini-fallback
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GOOGLE_API_KEY

router_settings:
  fallbacks:
    - {"claude-primary": ["gpt-fallback", "gemini-fallback"]}
  num_retries: 2
  retry_after: 5  # seconds between retries
  allowed_fails: 3  # failures before marking a model as unhealthy
  cooldown_time: 60  # seconds before retrying an unhealthy model

When Claude returns a 503 or rate limit error, LiteLLM automatically falls back to GPT-4o. If that also fails, it tries Gemini. The failover is transparent to the calling application.

For Python direct usage (without the proxy):

from litellm import completion

response = completion(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Summarize this article."}],
    fallbacks=["gpt-4o", "gemini/gemini-2.0-flash"],
    num_retries=2,
)

Cost-Based Routing

A common pattern: use a cheaper or smaller model by default, fall back to a more expensive model only when the cheap one isn’t good enough (or falls back due to rate limits).

model_list:
  - model_name: smart
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: smarter
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - {"smart": ["smarter"]}

Your application calls smart for all requests. If Haiku rate-limits or errors, it falls back to Sonnet. The team pays Haiku prices for the bulk of traffic and Sonnet prices only for overflow.

For explicit routing based on request type, the routing logic lives in your application:

def get_model_for_request(request_type: str) -> str:
    if request_type in ("code_review", "architecture_analysis"):
        return "anthropic/claude-sonnet-4-6"  # Complex tasks
    elif request_type in ("summarize", "classify", "extract"):
        return "anthropic/claude-haiku-4-5"   # Simple tasks
    else:
        return "openai/gpt-4o-mini"            # Default cheap option

This kind of semantic routing is application-specific. LiteLLM handles the infrastructure concerns; routing by request type is business logic that belongs in your application.

OpenRouter

OpenRouter is a hosted API service that provides a single endpoint for 200+ models from most major providers. You send an OpenAI-compatible request to https://openrouter.ai/api/v1/chat/completions with a model name, and OpenRouter routes it to the provider.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ['OPENROUTER_API_KEY'],
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",  # OpenRouter's model identifier format
    messages=[{"role": "user", "content": "Hello"}],
)

OpenRouter’s advantages over LiteLLM:

  • No infrastructure to run: no proxy server to deploy or maintain
  • Single billing relationship: pay OpenRouter once, access all providers
  • Automatic model fallback: configure in the request header or the OpenRouter dashboard
  • Model comparison: try the same prompt across multiple models via a shared API

The trade-offs:

  • Latency: requests go through OpenRouter’s servers before reaching the provider
  • Vendor dependency: OpenRouter itself is a new dependency, with its own reliability characteristics
  • Less control: less flexibility than self-hosted LiteLLM for complex routing logic
  • Markup: OpenRouter charges above provider list prices (typically a small percentage)

For side projects, early-stage products, or situations where you want to compare models quickly, OpenRouter is the faster path. For production applications where latency, cost, and control matter, LiteLLM self-hosted is usually the better fit.

What to Route and What Not To

Not every LLM call needs a routing layer. The complexity cost of routing is real.

Worth routing when:

  • Your application’s availability depends on the LLM being available (user-facing features, not internal tooling)
  • You’re spending enough on LLM costs that model selection materially affects your budget
  • You’ve already experienced provider outages affecting users

Not worth routing when:

  • Development or low-traffic side projects
  • Batch processing jobs that can queue and retry
  • Single-model workflows where fallback would produce inconsistent output (fine-tuned models, specific capabilities only one provider has)

The routing layer adds two things worth the overhead at scale: reliability through fallback, and cost control through model selection. If neither of these is an active problem for your application today, invest the time elsewhere.

Monitoring

A routing layer only helps if you know when it’s activating. Both LiteLLM and OpenRouter provide logging hooks:

# LiteLLM: success/failure callbacks
import litellm

def log_success(kwargs, response, start_time, end_time):
    print(f"Model used: {kwargs['model']}")
    print(f"Tokens: {response.usage.total_tokens}")
    print(f"Cost: ${litellm.completion_cost(response):.4f}")

def log_failure(kwargs, exception, start_time, end_time):
    print(f"Failed model: {kwargs['model']}")
    print(f"Exception: {exception}")
    print(f"Fallback triggered: {kwargs.get('metadata', {}).get('fallback_model')}")

litellm.success_callback = [log_success]
litellm.failure_callback = [log_failure]

Track fallback rate over time. A spike in fallbacks is a leading indicator of provider issues before they become customer-visible. A consistent high fallback rate suggests your routing order is wrong — your primary model is unreliable for your traffic pattern, and you should reconsider the priority.

The Architecture Decision

The mental model that helps: treat your LLM provider(s) as infrastructure dependencies with availability characteristics, not as magic APIs that are always up. The routing layer is the same pattern as connection pooling, circuit breakers, and database read replicas — it makes your application resilient to external dependencies that are out of your control.

The question isn’t whether you need a routing layer. It’s when the cost of building one is lower than the cost of not having one. For most production AI features serving real users, that threshold hits earlier than you’d expect.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored