Philosophy

The Question That Actually Matters

Benchmark frameworks answer: “How good is this model?”

Production teams need to know: “What changed?”

Your model scored 87% on MMLU. Great. Did prompt #47 start giving dangerous advice? Benchmarks won’t tell you.

Why Existing Frameworks Fall Short

Eleuther, HELM, OpenAI Evals — excellent for research. Inadequate for production.

They give you:

  • Aggregate scores (“accuracy: 0.87”)
  • Leaderboard rankings
  • Point-in-time snapshots

They don’t give you:

  • Which specific prompts regressed
  • Deterministic diffs for CI gating
  • Response-level change tracking
  • Confidence that nothing broke

What insideLLMs Does Differently

1. Differential Analysis, Not Scores

Benchmark approach: “Model accuracy dropped from 87% to 85%.”

insideLLMs approach: “Prompt #47 changed from ‘Consult a doctor’ to ‘Here’s what you should do’. Prompt #103 started hallucinating. Deploy blocked.”

You can’t debug a 2% drop. You can debug specific prompt regressions.


2. Determinism Enables CI Gating

Most frameworks: “Run it twice, get different timestamps, diffs are noisy.”

insideLLMs: Same inputs → byte-for-byte identical outputs.

  • Run IDs: SHA-256 of config + dataset
  • Timestamps: Derived from run ID, not wall clock
  • JSON: Stable formatting (sorted keys, consistent separators)

Result: git diff works on model behaviour.

insidellms diff ./baseline ./candidate --fail-on-changes
# Exit code 1 = behaviour changed = deploy blocked

3. Response-Level Granularity

Benchmark frameworks: “Here’s your aggregate score.”

insideLLMs: “Here’s every input/output pair in records.jsonl. Filter, analyse, debug.”

No more guessing which prompts failed. You see them.

4. Probes, Not Benchmarks

Benchmarks: Broad, static, external. Good for research.

Probes: Focused, composable, extensible. Good for production.

class MedicalSafetyProbe(Probe):
    def run(self, model, data, **kwargs):
        response = model.generate(data["symptom_query"])
        return {
            "response": response,
            "has_disclaimer": "consult a doctor" in response.lower()
        }

Build domain-specific tests. No forking required.


5. CI-Native Architecture

The entire design serves one workflow:

graph LR
    Baseline[Baseline Run] --> Repo[Version Control]
    PR[Pull Request] --> Candidate[Candidate Run]
    Candidate --> Diff[insidellms diff]
    Repo --> Diff
    Diff --> Gate{Pass?}
    Gate -->|No changes| Merge[Safe to merge]
    Gate -->|Changes| Review[Human review required]

This treats model behaviour like code:

  • Testable: Run probes on every PR
  • Diffable: See exactly what changed
  • Gateable: Block merges on behavioural regressions

When to Use insideLLMs

Use Case Why insideLLMs
Model upgrade Catch breaking changes before deploy
Provider switch Compare GPT-4 vs Claude on your prompts
Bias detection Test fairness across demographics
Safety testing Verify jailbreak resistance
Custom evaluation Build domain-specific probes

Framework Comparison

Framework Best For insideLLMs Difference
Eleuther lm-evaluation-harness Academic benchmarks CI-native, deterministic diffs
HELM Multi-dimensional scoring Response-level granularity
OpenAI Evals Conversational tasks Provider-agnostic, regression detection

Bottom line: Use benchmark frameworks for research. Use insideLLMs for production.


The Bottom Line

Benchmark frameworks: Tell you how good your model is.

insideLLMs: Tells you if it’s safe to ship.

One is for research. One is for production. Choose accordingly.