insideLLMs

Stop shipping LLM regressions. Deterministic behavioural testing that catches breaking changes before they reach production.

graph LR
    Dataset[Dataset] --> Runner[Runner]
    Model[Models] --> Runner
    Probe[Probes] --> Runner
    Runner --> Records[records.jsonl]
    Records --> Summary[summary.json]
    Records --> Report[report.html]
    Records --> Diff[diff.json]

The Problem

You update your LLM. Prompt #47 now gives dangerous medical advice. Prompt #103 starts hallucinating. Your users notice before you do.

Traditional eval frameworks can’t help. They tell you the model scored 87% on MMLU. They don’t tell you what changed.

The Solution

insideLLMs treats model behaviour like code: testable, diffable, gateable.

insidellms diff ./baseline ./candidate --fail-on-changes

If behaviour changed, the deploy blocks. Simple.


Start Here

Goal Path Time
See it work Quick InstallFirst Run 5 min
Compare models First Harness 15 min
Block regressions CI Integration 30 min
Understand the approach Philosophy 10 min

Why Teams Choose insideLLMs

Catch Regressions Before Production

Know exactly which prompts changed behaviour. No more debugging aggregate metrics.

CI-Native Design

Built for git diff on model behaviour. Deterministic artefacts. Stable diffs. Automated gates.

Response-Level Visibility

records.jsonl preserves every input/output pair. See what changed, not just that something changed.

Provider-Agnostic

OpenAI, Anthropic, Cohere, Google, local models (Ollama, llama.cpp, vLLM). One interface.


How It Works

1. Define behavioural tests

probes:
  - type: logic      # Reasoning consistency
  - type: bias       # Fairness across demographics
  - type: safety     # Jailbreak resistance

2. Run across models

insidellms harness config.yaml --run-dir ./baseline

3. Catch changes in CI

insidellms diff ./baseline ./candidate --fail-on-changes
# Exit code 1 if behaviour changed

Result: Breaking changes blocked. Users protected.


Documentation

Section Description
Philosophy Why insideLLMs exists and how it differs
Getting Started Install and run your first test
Tutorials Bias testing, CI integration, custom probes
Concepts Models, probes, runners, determinism
Advanced Features Pipeline, cost tracking, structured outputs
Reference Complete CLI and API documentation
Guides Caching, rate limiting, local models
FAQ Common questions and troubleshooting

What You Get That Others Don’t

Feature Eleuther HELM OpenAI Evals insideLLMs
CI diff-gating No No No Yes
Deterministic artefacts No No No Yes
Response-level granularity No Partial No Yes
Pipeline middleware No No No Yes
Cost tracking & budgets No No No Yes
Structured output parsing No No No Yes
Agent evaluation No No No Yes

Not just benchmarks. Production infrastructure.


Community