Determinism and CI

insideLLMs is designed so the “run → records → report → diff” spine can be used for CI diff-gating.

flowchart LR
  Config[Resolved config + dataset identity] --> Run[Run / Harness]
  Run --> Records[records.jsonl]
  Records --> Validate[Validate]
  Validate --> Report[report.html]
  Records --> Diff[diff.json]
  Report --> Diff

What’s Deterministic?

When you run via insidellms run or insidellms harness, insideLLMs writes:

records.jsonl (canonical record stream)
manifest.json (run metadata)
summary.json
report.html

By default, the run_id and emitted timestamps are derived deterministically from the resolved config snapshot. You can override this with --run-id.

The deterministic contract is byte-for-byte for the same inputs/config:

Canonical JSON emission for on-disk artifacts (stable key ordering and separators).
Stable ordering for plugin discovery and filesystem listings where relevant.
Idempotent reporting (insidellms report over the same run dir does not drift).
Resume safety: resumable runs validate the stored input fingerprint against the current prompt set, preventing mixed artifacts.

Volatile Fields (Intentionally Omitted)

Some values are inherently non-deterministic (wall-clock timing, shell invocation, host metadata). To keep the diff surface stable, canonical artifacts omit these:

ResultRecord.latency_ms is persisted as null.
manifest.json:command is persisted as null.

If you need timing/host details, use tracing/telemetry rather than the canonical CI artifacts.

Dataset Identity (`dataset_hash`)

For local file datasets (format: csv|jsonl), if a hash is not provided in the config, insideLLMs computes a deterministic dataset_hash=sha256:<file-bytes> and includes it in manifest.json. This makes run_id sensitive to dataset content changes.

flowchart TD
  Dataset[Dataset file bytes] --> Hash[dataset_hash = sha256(file_bytes)]
  Hash --> RunId[run_id derived from resolved config]
  RunId --> Time[Deterministic timestamps]
  RunId --> Dir[Run directory identity]

CI Diff-Gating Pattern

1) Produce a baseline run dir. 2) Produce a candidate run dir using the exact same harness config. 3) Diff the two run dirs and fail the build if anything changes.

insidellms harness ci/harness.yaml --run-dir .tmp/runs/base --overwrite --skip-report
insidellms harness ci/harness.yaml --run-dir .tmp/runs/head --overwrite --skip-report
insidellms diff .tmp/runs/base .tmp/runs/head --fail-on-changes

Recommended CI Harness

Use the repo’s minimal, offline harness:

Config: ci/harness.yaml
Dataset: ci/harness_dataset.jsonl

It uses DummyModel only (no API keys) and probes that accept dict inputs.

Useful Diff Flags

--fail-on-regressions: fail only on score/status regressions
--fail-on-changes: fail on any difference (including additions/removals)
--fail-on-trace-drift: fail if trace fingerprints drift (when enabled)
--fail-on-trace-violations: fail if contract violations increase (when enabled)
--output-fingerprint-ignore: ignore volatile keys when fingerprinting structured outputs

Trace-aware diffing (optional)

Trace flags only apply when records include trace data:

custom.trace.fingerprint.value or custom.trace_fingerprint
custom.trace.violations or custom.trace_violations

If these fields are missing, trace drift and trace violation checks are no-ops.

For structured outputs with volatile fields, use:

insidellms diff ... --output-fingerprint-ignore timestamp,request_id

See Tracing and Fingerprinting for details.