In the rapidly evolving landscape of Large Language Model (LLM) applications, developers often face significant challenges when integrating multiple LLM providers. Each provider—whether OpenAI, Anthropic, Cohere, or others—comes with its own API structure, authentication methods, and response formats, creating a complex integration landscape.
What is LiteLLM?
LiteLLM is an open-source library that provides a unified interface for working with various LLM providers. It acts as an abstraction layer that standardizes interactions with different LLMs, allowing developers to write consistent code regardless of which model they’re using.
Key Advantages Over Native APIs
- Unified Interface: Write once, deploy anywhere—your code works the same way across OpenAI, Anthropic, Azure, Cohere, and 100+ other LLMs.
- Simplified Provider Switching: Change models with a single line of code instead of rewriting entire integration layers.
- Cost Optimization: Easily switch between models based on performance needs and pricing considerations without code refactoring.
- Enhanced Reliability: Built-in retry and fallback mechanisms provide resilience against API outages or rate limits.
- Enterprise-Ready Features: Access logging, monitoring, and budget management capabilities not available in native APIs.
- Vendor Independence: Avoid vendor lock-in by designing your application to be model-agnostic from the start.
This article explores some of LiteLLM’s most powerful features that help developers build more reliable and cost-effective LLM applications.
Basic Usage
Installation
pip install litellm
Environment Setup
We need to setup environment variables for the Anthropic and OpenAI APIs:
export ANTHROPIC_API_KEY=your-key-here
export OPENAI_API_KEY=your-key-here
Getting started with LiteLLM is straightforward. After installation, you can immediately begin using multiple LLM providers through a consistent interface:
import litellm
from litellm import completion
# Basic completion with OpenAI
openai_response = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(openai_response)
# Same code structure works with Anthropic
anthropic_response = completion(
model="anthropic/claude-3-5-sonnet-latest",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(anthropic_repsonse.get('choices')[0].get('message').get('content'))
# Switch between models with a single parameter change
response = completion(
model="gpt-4", # Just change the model name to switch providers
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response)
This unified API allows you to interact with any supported LLM using the same code pattern, dramatically simplifying development and maintenance.
Retry and Fallback Mechanisms
LiteLLM provides robust retry and fallback mechanisms to enhance the reliability of LLM integrations. These features help ensure successful completions even in the face of temporary failures or rate-limiting issues.
Retries
LiteLLM allows you to specify the number of retries for a request in case of failure. Here’s an example:
from litellm import completion
response = completion(
model="anthropic/claude-3-5-sonnet-latest",
messages=[{"role": "user", "content": "What is the capital of France?"}],
num_retries=2 # Retry up to 2 times if the initial request fails
)
In this example, the completion
function is called with num_retries
set to 2, meaning that if the first attempt fails, LiteLLM will automatically retry the request up to two additional times.
Fallbacks
In addition to retries, LiteLLM provides powerful fallback mechanisms that enable seamless recovery from failures without disrupting your application. This includes the ability to fall back to different models, providers, or model families when necessary.
Model Fallbacks
LiteLLM supports fallback to different models, which is useful when specific models fail or hit rate limits:
from litellm import completion
response = completion(
model="anthropic/claude-3-5-sonnet-latest",
messages=[{"role": "user", "content": "What is the capital of France?"}],
fallbacks=["anthropic/claude-3-5-opus-latest"] # Fallback to a longer context model if needed
)
print(response)
Cross-Provider Fallbacks
One of LiteLLM’s most powerful features is the ability to fall back across different providers, ensuring high availability even if an entire provider experiences an outage:
from litellm import completion
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
fallbacks=["anthropic/claude-3-5-sonnet-latest", "gpt-4o-mini"] # Cross-provider fallbacks
)
print(response)
In this example, if the initial request to OpenAI’s GPT-4 fails, LiteLLM will seamlessly try Anthropic’s Claude, and if that fails, it will try Google’s Gemini Pro.
Dynamic Model Selection
For more advanced scenarios, you can create a router with multiple models and dynamic routing:
import litellm
from litellm.router import Router
import os
# Define a router with multiple models and routing rules
router = Router(
model_list=[
{
"model_name": "gpt-3.5-turbo", # This is the model name we'll reference
"litellm_params": {
"model": "gpt-3.5-turbo", # This is the actual model identifier
"api_key": os.environ.get("OPENAI_API_KEY")
},
"tpm": 100000, # Tokens per minute limit
"rpm": 1000 # Requests per minute limit
},
{
"model_name": "claude-3-5-sonnet", # Model name for routing
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-latest", # Actual model identifier
"api_key": os.environ.get("ANTHROPIC_API_KEY")
},
"tpm": 80000,
"rpm": 900
}
],
routing_strategy="simple-shuffle" # Try a simpler strategy
)
# Route to the best available model automatically
# Important: Use the model_name from our configuration, not the provider's full model name
response = router.completion(
model="claude-3-5-sonnet", # Must match the model_name in our config
messages=[{"role": "user", "content": "Explain AI to me"}]
)
print(response)
# We can also use fallbacks directly with the router
response = router.completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Explain quantum computing"}],
fallbacks=["claude-3-5-sonnet"] # Use model_name values as fallbacks
)
print(response)
This configuration automatically routes requests based on model availability, rate limits, and other factors, providing robust fallback capabilities for production applications.
Proxy Capabilities
LiteLLM includes a proxy server (LLM Gateway) that provides additional features and capabilities for LLM integration. The proxy server acts as an intermediary between your application and the LLM providers, enabling advanced functionality like rate limiting, caching, and more.
Installing the Proxy Addon
pip install litellm[proxy]
Setting Up the Proxy Server
To start the LiteLLM proxy server, you can use the command-line interface:
# Start the proxy server on the default port (8000)
litellm --model gpt-4o
# Or specify a custom port
litellm --model gpt-4o --port 8081
You can also configure the proxy server through a configuration file:
litellm --config /path/to/config.yaml
Using the Proxy in Your Application
Once the proxy server is running, you can easily use it in your application (set the port to match the litellm proxy in base_url):
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://127.0.0.1:8081",
)
# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
model="gpt-4o",
messages = [
{
"role": "user",
"content": "What's the capitol of the US state of Georgia"
}
]
)
print(response)
Rate Limiting
The LiteLLM proxy server supports rate limiting to prevent abuse and ensure fair usage:
# In your proxy configuration file (config.yaml)
rate_limits:
- api_key: "sk-my-key-1"
rpm: 10 # 10 requests per minute
- api_key: "sk-my-key-2"
tpm: 10000 # 10,000 tokens per minute
- model: "gpt-4"
rpm: 5 # 5 requests per minute for this model specifically
Request Routing and Load Balancing
The proxy can route requests across multiple deployments for load balancing:
# In your proxy configuration file (config.yaml)
router_settings:
routing_strategy: "least-busy" # Options: "least-busy", "simple-shuffle", "usage-based"
model_group:
- name: "gpt-4-group"
models: ["gpt-4", "anthropic/claude-opus"]
Now in your application:
import litellm
response = litellm.completion(
model="gpt-4-group", # Use the model group name
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Caching
Enable caching to improve performance and reduce costs:
# In your proxy configuration file (config.yaml)
cache_settings:
cache_type: "redis" # Options: "redis", "in-memory"
redis_host: "localhost"
redis_port: 6379
redis_password: ""
cache_time: 3600 # Cache expiration time in seconds
Monitoring and Logging
The proxy provides extensive logging capabilities:
# In your proxy configuration file (config.yaml)
logging:
level: "info" # Options: "debug", "info", "warning", "error"
log_file: "/path/to/litellm.log"
telemetry:
provider: "prometheus" # Options: "prometheus", "cloudwatch"
metrics_port: 9090
With these proxy capabilities, you can build enterprise-grade LLM applications with enhanced reliability, performance, and cost control. The proxy server makes it easy to implement advanced features without adding complexity to your application code.
Budget Management with Proxy
For production deployments, you can use LiteLLM’s proxy server with a configuration file that includes budget settings:
# In your proxy configuration file (config.yaml)
general_settings:
# Set a default budget per key
default_key_generate_params:
max_budget: 50.0 # $50 USD default budget
budget_duration: "monthly" # Reset monthly
router_settings:
# Track costs for each model
track_cost_per_model: true
virtual_keys:
- key_alias: "team-research"
models: ["gpt-4", "anthropic/claude-3-5-sonnet-latest"]
max_budget: 100.0 # $100 budget for research team
budget_duration: "monthly"
- key_alias: "team-support"
models: ["gpt-3.5-turbo", "mistral/mistral-small"]
max_budget: 50.0 # $50 budget for support team
budget_duration: "monthly"
Start the proxy with:
litellm --config /path/to/config.yaml
LiteLLM’s budget management features provide granular control over API spending, making it easier to manage costs in production environments where unexpected usage spikes could lead to significant expenses.
Conclusion
LiteLLM addresses a critical need in the LLM development ecosystem by providing a unified, robust interface to multiple LLM providers. By abstracting away the differences between various APIs, it enables developers to focus on building applications rather than managing integration complexities. This unified approach enhances reliability through built-in retry mechanisms and cross-provider fallbacks that ensure high availability even when specific models or providers experience outages.
Cost control becomes much more manageable with LiteLLM’s comprehensive budget management features that prevent unexpected spending through customizable limits and alerts. Performance is optimized through intelligent caching and routing capabilities that improve response times while reducing API costs. Perhaps most importantly, LiteLLM provides true vendor flexibility, allowing developers to switch between models or providers with minimal code changes.