Rate Limiting
Handle API rate limits gracefully.
Why Rate Limiting?
API providers limit requests per minute/hour:
| Provider | Limit (typical) |
|---|---|
| OpenAI | 60-10,000 RPM |
| Anthropic | 60-4,000 RPM |
| 60-1,000 RPM |
Exceeding limits causes errors and potential account issues.
Enabling Rate Limiting
In Config
rate_limit:
enabled: true
requests_per_minute: 60
requests_per_second: 1
With Concurrency
async: true
concurrency: 10
rate_limit:
enabled: true
requests_per_minute: 300
The rate limiter coordinates across concurrent requests.
Automatic Retry
When rate limited, insideLLMs retries with exponential backoff:
rate_limit:
enabled: true
requests_per_minute: 60
retry:
max_attempts: 3
initial_delay: 1.0
max_delay: 60.0
exponential_base: 2.0
Per-Provider Limits
Different limits for different providers:
models:
- type: openai
args:
model_name: gpt-4o
rate_limit:
requests_per_minute: 500
- type: anthropic
args:
model_name: claude-3-5-sonnet
rate_limit:
requests_per_minute: 60
Token-Based Limiting
For token limits (common with OpenAI):
rate_limit:
enabled: true
tokens_per_minute: 90000
Monitoring Rate Limits
Check Current State
from insideLLMs.rate_limiting import get_rate_limiter
limiter = get_rate_limiter()
print(f"Requests remaining: {limiter.remaining}")
print(f"Reset in: {limiter.reset_in} seconds")
Rate Limit Headers
insideLLMs reads provider headers automatically:
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1234567890
Strategies
Conservative (Development)
async: true
concurrency: 2
rate_limit:
requests_per_minute: 30
Balanced (Production)
async: true
concurrency: 10
rate_limit:
requests_per_minute: 300
Aggressive (High Tier)
async: true
concurrency: 50
rate_limit:
requests_per_minute: 3000
Combining with Caching
Reduce rate limit pressure:
cache:
enabled: true
backend: sqlite
rate_limit:
enabled: true
requests_per_minute: 60
Cached responses don’t count against rate limits.
Error Handling
Rate limit errors are caught and handled:
from insideLLMs.exceptions import RateLimitError
try:
results = runner.run(prompt_set)
except RateLimitError as e:
print(f"Rate limited: {e}")
print(f"Retry after: {e.retry_after} seconds")
Best Practices
Do
- Start conservative, increase gradually
- Enable caching to reduce load
- Monitor provider dashboards
- Use appropriate tier for workload
Don’t
- Ignore rate limit errors
- Set limits higher than your tier
- Run parallel jobs without coordination
Troubleshooting
“Rate limit exceeded”
# Lower concurrency
concurrency: 3
# Lower rate
rate_limit:
requests_per_minute: 30
Requests still failing
Check your provider tier limits:
- OpenAI: Rate limits
- Anthropic: Rate limits
Inconsistent limiting
Ensure single rate limiter instance:
# Use singleton
from insideLLMs.rate_limiting import get_rate_limiter
limiter = get_rate_limiter() # Same instance everywhere
See Also
- Caching - Reduce API calls
- Performance and Caching - More options