Rate Limiting
Handle API rate limits gracefully.
Why Rate Limiting?
API providers limit requests per minute/hour:
| Provider | Limit (typical) |
|---|---|
| OpenAI | 60-10,000 RPM |
| Anthropic | 60-4,000 RPM |
| 60-1,000 RPM |
Exceeding limits causes errors and potential account issues.
Enabling Rate Limiting
In Config (Pipeline Middleware)
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 60
burst_size: 10
probe:
type: logic
dataset:
format: jsonl
path: data/prompts.jsonl
With Concurrency
insidellms run config.yaml --async --concurrency 10
The rate-limit middleware coordinates across concurrent requests.
Automatic Retry
When rate limited, insideLLMs retries with exponential backoff:
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 60
- type: retry
args:
max_retries: 3
initial_delay: 1.0
max_delay: 60.0
exponential_base: 2.0
Per-Provider Limits
Different limits for different providers:
models:
- type: openai
args:
model_name: gpt-4o
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 500
- type: anthropic
args:
model_name: claude-3-5-sonnet
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 60
Token-Based Limiting
For token limits (common with OpenAI):
from insideLLMs.rate_limiting import TokenBucketRateLimiter
tokens_per_minute = 90_000
token_limiter = TokenBucketRateLimiter(
rate=tokens_per_minute / 60,
capacity=tokens_per_minute,
)
# Acquire estimated token budget before a call
estimated_tokens = 800
token_limiter.acquire(tokens=estimated_tokens, block=True)
Monitoring Rate Limits
Check Current State
from insideLLMs.rate_limiting import TokenBucketRateLimiter
limiter = TokenBucketRateLimiter(rate=1.0, capacity=5)
state = limiter.get_state()
print(f"Available tokens: {state.available_tokens}")
print(f"Is limited: {state.is_limited}")
print(f"Wait time (ms): {state.wait_time_ms}")
Rate Limit Headers
When providers return retry metadata, handle it via RateLimitError:
from insideLLMs.exceptions import RateLimitError
try:
results = runner.run(prompt_set)
except RateLimitError as e:
print(f"Rate limited: {e}")
if e.retry_after is not None:
print(f"Retry after: {e.retry_after} seconds")
Strategies
Conservative (Development)
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 30
burst_size: 2
Run with low concurrency: insidellms run config.yaml --async --concurrency 2
Balanced (Production)
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 300
burst_size: 20
Run with moderate concurrency: insidellms run config.yaml --async --concurrency 10
Aggressive (High Tier)
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 3000
burst_size: 100
Run with high concurrency: insidellms run config.yaml --async --concurrency 50
Combining with Caching
Reduce rate limit pressure:
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: cache
args:
cache_size: 1000
- type: rate_limit
args:
requests_per_minute: 60
Cached responses don’t count against rate limits.
Error Handling
Rate limit errors are caught and handled:
from insideLLMs.exceptions import RateLimitError
try:
results = runner.run(prompt_set)
except RateLimitError as e:
print(f"Rate limited: {e}")
print(f"Retry after: {e.retry_after} seconds")
Best Practices
Do
- Start conservative, increase gradually
- Enable caching to reduce load
- Monitor provider dashboards
- Use appropriate tier for workload
Don’t
- Ignore rate limit errors
- Set limits higher than your tier
- Run parallel jobs without coordination
Troubleshooting
“Rate limit exceeded”
# Lower middleware rate
model:
type: openai
args:
model_name: gpt-4o-mini
pipeline:
middlewares:
- type: rate_limit
args:
requests_per_minute: 30
Run with lower concurrency: insidellms run config.yaml --async --concurrency 3
Requests still failing
Check your provider tier limits:
- OpenAI: Rate limits
- Anthropic: Rate limits
Inconsistent limiting
Ensure single rate limiter instance:
from insideLLMs.pipeline import ModelPipeline, RateLimitMiddleware
rate_limit = RateLimitMiddleware(requests_per_minute=60)
pipeline = ModelPipeline(model, middlewares=[rate_limit])
See Also
- Caching - Reduce API calls
- Performance and Caching - More options