No vibes. Only evidence.

Model behaviour that drifts silently is an operational risk.

The adoption and governance layer for the insideLLMs open-source toolchain: deterministic probes, response-level diffs, and CI gating for production confidence.

Probe-driven testing + deterministic artefacts + policy gates = controllable AI rollout.

View behavioural diff demo GitHub Docs Governance

Quick start

Use the same command flow promoted in the upstream project docs, then layer governance review on top.

Step 1

Run a fast confidence check

Start with insidellms quicktest "your prompt" to spot obvious failure modes quickly.

Step 2

Execute deterministic harness probes

Run insidellms harness probes/financial.toml --output out/ to produce reproducible run artefacts.

Step 3

Diff and gate in CI

Use insidellms diff baseline.json candidate.json --fail-on-changes to block unsafe drift before merge.

How this site complements the docs

GitHub docs site

Canonical product documentation, command references, and implementation details for engineers integrating the library and CLI.

Read official docs

This website

Positioning, rollout guidance, governance framing, and buyer-facing evidence language for platform, risk, and compliance stakeholders.

Open implementation map

Evidence chain

The chain is the product: if the chain is incomplete, the claim is unactionable.

Capture

Run manifest

Every run stores inputs · environment · model config · tools · outputs as a single artefact.

Manifest: run-id r_8f3a1d2 · hash 4b2af4c · signer kms:key-42

Mandatory trust anchor: if two teams run the same manifest and seed, outputs are comparable and reproducible.

Replay

Deterministic replay

Replay is the primitive. It is explicit, versioned, and recorded. Replay results are evaluated like any other engineering build artefact.

Environment lock-step
Tool-call snapshots
Stable output encoding

Diff

Behavioural diff as first-class data

Baseline vs candidate outputs become the visible boundary between intended and introduced behaviour. This is the gate input.

Token-level deltas
Semantic change score
Risk tags

Gate

CI enforcement

The gate consumes diff artefacts and policy rules. Failing drift fails builds before it reaches users.

insidellms diff baseline.json candidate.json --fail-on-changes

Behavioural diff viewer

Diffs are not commentary; they are evidence. This pattern scales from one regression test to large harnesses.

This is a sample where the visible regression is semantically subtle but operationally significant. The tool change path is now observable in the artefact.

Governance mapping and accountability

Run Manifest

A manifest proves what was run and why. If the manifest is not signed, the run is incomplete evidence.

Inputs
prompt_hash=ae81b · model=llm-4.2 · temp=0 · tools=[retrieval,crm_lookup]
Signed @ 2026-02-13T10:22:18Z by k8s-runner-3

Environment lock details (seed, image, dependency lockfiles)
Tool boundary policy
Output schema and verifier rules

Tool-augmented blast radius

Tools are part of behaviour. They need traceability as first-class events, not a side channel.

Tool inputs and return payload snapshots
External dependency versioning
Action class (read/write/transfer)

The risk lens re-phrases evidence only; it does not change underlying truth.

Run timeline / trace explorer

Deterministic replay becomes trustworthy when the chain of events is inspectable.

Adoption path

1. Baseline lock

Capture stable baseline runs for your critical flows and publish minimal baseline policy.

2. Enforce gates

Enable CI checks on high-risk suites and require explicit waiver flows.

3. Package evidence

Generate audit packs by environment and business domain for procurement, internal audit, and incident reviews.

Credibility strip

Chain of custody: manifest hash, signature, timestamp, and verifier key.

Audit-ready export: manifest + diff + timeline + policy decision bundle.

Determinism report: replay error budget, model/runtime versions, and tool-call inventory.

Model behaviour that drifts silently is an operational risk.

Quick start

Run a fast confidence check

Execute deterministic harness probes

Diff and gate in CI

How this site complements the docs

GitHub docs site

This website

Evidence chain

Run manifest

Deterministic replay

Behavioural diff as first-class data

CI enforcement

Behavioural diff viewer

Governance mapping and accountability

Run Manifest

Tool-augmented blast radius

Run timeline / trace explorer

Adoption path

1. Baseline lock

2. Enforce gates

3. Package evidence

Credibility strip

Core Process Pipeline