insideLLMs

Stop shipping LLM regressions. Deterministic behavioural testing that catches breaking changes before they reach production.

graph LR
    Dataset[Dataset] --> Runner[Runner]
    Model[Models] --> Runner
    Probe[Probes] --> Runner
    Runner --> Records[records.jsonl]
    Records --> Summary[summary.json]
    Records --> Report[report.html]
    Records --> Diff[diff.json]

The Problem

You update your LLM. Prompt #47 now gives dangerous medical advice. Prompt #103 starts hallucinating. Your users notice before you do.

Traditional eval frameworks can’t help. They tell you the model scored 87% on MMLU. They don’t tell you what changed.

The Solution

insideLLMs treats model behaviour like code: testable, diffable, gateable.

insidellms diff ./baseline ./candidate --fail-on-changes

If behaviour changed, the deploy blocks. Simple.

Start Here

Goal	Path	Time
See it work	Quick Install → First Run	5 min
Compare models	First Harness	15 min
Block regressions	CI Integration	30 min
Understand the approach	Philosophy	10 min

Why Teams Choose insideLLMs

Catch Regressions Before Production

Know exactly which prompts changed behaviour. No more debugging aggregate metrics.

CI-Native Design

Built for git diff on model behaviour. Deterministic artefacts. Stable diffs. Automated gates.

Response-Level Visibility

records.jsonl preserves every input/output pair. See what changed, not just that something changed.

Provider-Agnostic

OpenAI, Anthropic, Cohere, Google, local models (Ollama, llama.cpp, vLLM). One interface.

How It Works

1. Define behavioural tests

probes:
  - type: logic      # Reasoning consistency
  - type: bias       # Fairness across demographics
  - type: safety     # Jailbreak resistance

2. Run across models

insidellms harness config.yaml --run-dir ./baseline

3. Catch changes in CI

insidellms diff ./baseline ./candidate --fail-on-changes
# Exit code 1 if behaviour changed

Result: Breaking changes blocked. Users protected.

Documentation

Section	Description
Philosophy	Why insideLLMs exists and how it differs
Getting Started	Install and run your first test
Tutorials	Bias testing, CI integration, custom probes
Concepts	Models, probes, runners, determinism
Advanced Features	Pipeline, cost tracking, structured outputs
Reference	Complete CLI and API documentation
Guides	Caching, rate limiting, local models
FAQ	Common questions and troubleshooting

What You Get That Others Don’t

Feature	Eleuther	HELM	OpenAI Evals	insideLLMs
CI diff-gating	No	No	No	Yes
Deterministic artefacts	No	No	No	Yes
Response-level granularity	No	Partial	No	Yes
Pipeline middleware	No	No	No	Yes
Cost tracking & budgets	No	No	No	Yes
Structured output parsing	No	No	No	Yes
Agent evaluation	No	No	No	Yes

Not just benchmarks. Production infrastructure.