This page collects the knobs that affect throughput, latency, cost, and determinism. It covers the middleware pipeline, caching modules, async runners, and rate limiting utilities.
Quick map of knobs (where to set them)
- CLI:
insidellms run --async --concurrency N(defaults ininsideLLMs/cli.py) - Async runners:
RunConfig.concurrency(default 5 ininsideLLMs/config_types.py) - Config YAML:
RunnerConfig.concurrency(default 1 ininsideLLMs/config.py) - Pipeline middleware:
CacheMiddleware,RateLimitMiddleware,RetryMiddleware(insideLLMs/runtime/pipeline.py) - Adapter settings:
AdapterConfig.timeout,max_retries,retry_delay,rate_limit_rpm/tpm(insideLLMs/adapters.py) - Advanced rate limiting:
insideLLMs/rate_limiting.py(token buckets, retry handler, circuit breaker, concurrency limiter)
Caching options
1) Pipeline cache (in-memory, per pipeline instance)
- Class:
CacheMiddleware(insideLLMs/runtime/pipeline.py) - What it caches:
generate()/agenerate()only (not chat/stream). - Key: SHA-256 of JSON-serialized
{prompt, **kwargs}(sort_keys=True). - Eviction: oldest entries removed when
cache_sizeexceeded (insertion order); optionalttl_secondsexpiration (checked on access). - Scope: per pipeline instance; not shared across processes; async uses
asyncio.Lock. - Gotcha: kwargs must be JSON-serializable or key generation will raise.
Example:
from insideLLMs.runtime.pipeline import CacheMiddleware, ModelPipeline
pipeline = ModelPipeline(model, middlewares=[CacheMiddleware(cache_size=1000, ttl_seconds=3600)])
2) ModelWrapper cache_responses (in-memory)
- Class:
ModelWrapper(insideLLMs/models/base.py) - What it caches:
generate()only. - Key:
f"{prompt}:{sorted(kwargs.items())}"(simple string key). - Scope: per wrapper instance; no TTL.
3) Unified caching module (general + LLM-specific)
- Module:
insideLLMs.caching(re-exportscaching_unified.py) - Backends:
InMemoryCache,DiskCache(SQLite persistence),StrategyCache(LRU/LFU/FIFO/TTL/SIZE). - LLM helpers:
PromptCache,CachedModel, cache warming + stats.
4) Semantic cache (exact + similarity; Redis optional)
- Module:
insideLLMs/semantic_cache.py - Capabilities: semantic similarity lookup with embeddings + exact match; optional Redis backend for distributed caching.
- Model wrapper:
SemanticCacheModelcachesgenerateandchat. - Determinism guard:
cache_temperature_zero_only=Trueby default (only temp=0 responses cached).
5) Routing caches (for model routing)
- Module:
insideLLMs/routing.py - Config:
RouterConfig.cache_routes+cache_ttl_secondscaches route decisions; route description embeddings cached in matcher.
In-memory vs persistent/Redis
- In-memory:
CacheMiddleware,ModelWrapper,InMemoryCache,PromptCache(fast, per process). - Persistent (local disk):
DiskCache(SQLite) fromcaching_unified.pysurvives restarts. - Distributed (Redis):
SemanticCachecan use Redis (redisoptional dep). Useful for multi-process / multi-host caching.
Determinism and caching
- Cache keys are deterministic, but cached values reflect the first response seen for that prompt+params.
- For reproducible runs:
- Set
temperature=0. - Use
SemanticCacheModelwithcache_temperature_zero_only=True(default). - Prefer exact-match caching over semantic matching when determinism matters.
- Use
concurrency=1for strict reproducibility. - Treat persistent caches as part of the experiment state (clear between runs if needed).
- Set
Async + concurrency (--async / --concurrency)
- CLI:
insidellms run --async --concurrency 5(default 5). - Async runner:
AsyncProbeRunner.run(..., concurrency=N)uses anasyncio.Semaphoreand writes results in input order.
Safe usage tips:
- Start low (1–5) and increase gradually; watch for 429s/timeouts.
- Combine concurrency with rate limiting to avoid bursts.
- If the base model is sync-only, async paths run in a thread pool; very high concurrency can exhaust threads.
Rate limiting, retries, timeouts
Pipeline middleware
- RateLimitMiddleware: token bucket in requests per minute (default 60). Sleeps when limited.
- RetryMiddleware: backoff + jitter; retries common transient errors (default
max_retries=3).
Ordering matters (middleware list is request order):
- Put
CacheMiddlewareearly so hits bypass rate limiting and retries. - If you want retries to re-enter the rate limiter, place
RetryMiddlewarebeforeRateLimitMiddleware.
Adapter-level timeouts / retry fields
AdapterConfig.timeoutis passed to provider clients (e.g., OpenAI/Anthropic).max_retries,retry_delay,rate_limit_rpm/tpmare configuration fields only today (not enforced in adapter code); use middleware orinsideLLMs/rate_limiting.pyfor actual behaviour.
Provider rate limits (and starting values)
insideLLMs does not hard-code provider limits; use your provider’s published RPM/TPM quotas.
Practical starting points (conservative):
--concurrency: 1–5RateLimitMiddleware(requests_per_minute=30–60, burst_size=...)- Increase slowly while monitoring 429s/timeouts and latency.
If your provider publishes TPM limits, add a token-based limiter and/or reduce max_tokens per request.
Cost control tips
- Turn on caching for repeated prompts (CacheMiddleware, CachedModel, SemanticCacheModel).
- Keep
temperature=0where possible to maximise cache hits and determinism. - Lower
max_tokensand add stop sequences to cap output length. - Use smaller models for development or smoke tests.
- Use
--asyncto reduce wall-clock time, but keep concurrency within rate limits to avoid retries. - Use
resumefor long runs so you don’t re-pay for completed items.