Rate Limiting

Handle API rate limits gracefully.

Why Rate Limiting?

API providers limit requests per minute/hour:

Provider Limit (typical)
OpenAI 60-10,000 RPM
Anthropic 60-4,000 RPM
Google 60-1,000 RPM

Exceeding limits causes errors and potential account issues.

Enabling Rate Limiting

In Config (Pipeline Middleware)

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 60
          burst_size: 10

probe:
  type: logic

dataset:
  format: jsonl
  path: data/prompts.jsonl

With Concurrency

insidellms run config.yaml --async --concurrency 10

The rate-limit middleware coordinates across concurrent requests.

Automatic Retry

When rate limited, insideLLMs retries with exponential backoff:

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 60
      - type: retry
        args:
          max_retries: 3
          initial_delay: 1.0
          max_delay: 60.0
          exponential_base: 2.0

Per-Provider Limits

Different limits for different providers:

models:
  - type: openai
    args:
      model_name: gpt-4o
    pipeline:
      middlewares:
        - type: rate_limit
          args:
            requests_per_minute: 500

  - type: anthropic
    args:
      model_name: claude-3-5-sonnet
    pipeline:
      middlewares:
        - type: rate_limit
          args:
            requests_per_minute: 60

Token-Based Limiting

For token limits (common with OpenAI):

from insideLLMs.rate_limiting import TokenBucketRateLimiter

tokens_per_minute = 90_000
token_limiter = TokenBucketRateLimiter(
    rate=tokens_per_minute / 60,
    capacity=tokens_per_minute,
)

# Acquire estimated token budget before a call
estimated_tokens = 800
token_limiter.acquire(tokens=estimated_tokens, block=True)

Monitoring Rate Limits

Check Current State

from insideLLMs.rate_limiting import TokenBucketRateLimiter

limiter = TokenBucketRateLimiter(rate=1.0, capacity=5)
state = limiter.get_state()
print(f"Available tokens: {state.available_tokens}")
print(f"Is limited: {state.is_limited}")
print(f"Wait time (ms): {state.wait_time_ms}")

Rate Limit Headers

When providers return retry metadata, handle it via RateLimitError:

from insideLLMs.exceptions import RateLimitError

try:
    results = runner.run(prompt_set)
except RateLimitError as e:
    print(f"Rate limited: {e}")
    if e.retry_after is not None:
        print(f"Retry after: {e.retry_after} seconds")

Strategies

Conservative (Development)

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 30
          burst_size: 2

Run with low concurrency: insidellms run config.yaml --async --concurrency 2

Balanced (Production)

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 300
          burst_size: 20

Run with moderate concurrency: insidellms run config.yaml --async --concurrency 10

Aggressive (High Tier)

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 3000
          burst_size: 100

Run with high concurrency: insidellms run config.yaml --async --concurrency 50

Combining with Caching

Reduce rate limit pressure:

model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: cache
        args:
          cache_size: 1000
      - type: rate_limit
        args:
          requests_per_minute: 60

Cached responses don’t count against rate limits.

Error Handling

Rate limit errors are caught and handled:

from insideLLMs.exceptions import RateLimitError

try:
    results = runner.run(prompt_set)
except RateLimitError as e:
    print(f"Rate limited: {e}")
    print(f"Retry after: {e.retry_after} seconds")

Best Practices

Do

  • Start conservative, increase gradually
  • Enable caching to reduce load
  • Monitor provider dashboards
  • Use appropriate tier for workload

Don’t

  • Ignore rate limit errors
  • Set limits higher than your tier
  • Run parallel jobs without coordination

Troubleshooting

“Rate limit exceeded”

# Lower middleware rate
model:
  type: openai
  args:
    model_name: gpt-4o-mini
  pipeline:
    middlewares:
      - type: rate_limit
        args:
          requests_per_minute: 30

Run with lower concurrency: insidellms run config.yaml --async --concurrency 3

Requests still failing

Check your provider tier limits:

Inconsistent limiting

Ensure single rate limiter instance:

from insideLLMs.pipeline import ModelPipeline, RateLimitMiddleware

rate_limit = RateLimitMiddleware(requests_per_minute=60)
pipeline = ModelPipeline(model, middlewares=[rate_limit])

See Also