Artifacts

insideLLMs produces structured artifacts for analysis and CI integration.

Overview

graph TD
    Run[Runner Execution] --> Records[records.jsonl]
    Run --> Manifest[manifest.json]
    Run --> Config[config.resolved.yaml]
    
    Records --> Summary[summary.json]
    Records --> Report[report.html]
    
    Records --> Diff[diff.json]
    Baseline[Baseline records] --> Diff

Artifact Types

Artifact Purpose When Created
records.jsonl Raw results During run
manifest.json Run metadata After completion
config.resolved.yaml Full config Start of run
summary.json Aggregated stats After completion
report.html Human report On request
diff.json Run comparison Via insidellms diff

records.jsonl

The canonical output. One JSON line per result:

{
  "schema_version": "1.0.0",
  "run_id": "a1b2c3d4...",
  "started_at": "2009-03-14T15:09:26.535897+00:00",
  "completed_at": "2009-03-14T15:09:26.535898+00:00",
  "model": {
    "model_id": "gpt-4o",
    "provider": "openai"
  },
  "probe": {
    "probe_id": "logic"
  },
  "dataset": {
    "dataset_id": "test.jsonl",
    "dataset_hash": "sha256:abc123..."
  },
  "example_id": "0",
  "input": {"question": "What is 2 + 2?"},
  "output": "4",
  "status": "success",
  "error": null,
  "error_type": null
}

Key Fields

Field Description
schema_version Artifact schema version
run_id Deterministic run identifier
started_at Deterministic timestamp
model Model specification
probe Probe specification
example_id Input identifier
input Original input data
output Model/probe output
status "success" or "error"

manifest.json

Run-level metadata:

{
  "schema_version": "1.0.0",
  "run_id": "a1b2c3d4...",
  "created_at": "2009-03-14T15:09:26.535897+00:00",
  "started_at": "2009-03-14T15:09:26.535897+00:00",
  "completed_at": "2009-03-14T15:09:26.535899+00:00",
  "run_completed": true,
  "library_version": "0.1.0",
  "python_version": "3.11.0",
  "platform": "macOS-14.0-arm64",
  "model": {...},
  "probe": {...},
  "dataset": {...},
  "record_count": 100,
  "success_count": 98,
  "error_count": 2,
  "records_file": "records.jsonl"
}

config.resolved.yaml

The fully resolved configuration:

model:
  type: openai
  args:
    model_name: gpt-4o
    temperature: 0.7
probe:
  type: logic
  args: {}
dataset:
  format: jsonl
  path: /absolute/path/to/data.jsonl
  dataset_hash: sha256:abc123...

Useful for:

  • Reproducing runs exactly
  • Debugging path resolution
  • Auditing configurations

summary.json

Aggregated statistics:

{
  "schema_version": "1.0.0",
  "run_id": "a1b2c3d4...",
  "models": {
    "gpt-4o": {
      "success_rate": 0.98,
      "example_count": 100,
      "error_count": 2
    }
  },
  "probes": {
    "logic": {
      "success_rate": 0.98
    }
  },
  "overall": {
    "success_rate": 0.98,
    "total_examples": 100
  }
}

report.html

Standalone HTML report with:

  • Model comparison tables
  • Success/failure breakdown
  • Individual response viewer
  • Filtering and search

Open directly in any browser.


diff.json

Comparison between two runs:

{
  "baseline_run_id": "abc123...",
  "candidate_run_id": "def456...",
  "baseline_path": "/path/to/baseline",
  "candidate_path": "/path/to/candidate",
  "changes": [
    {
      "example_id": "42",
      "field": "output",
      "baseline": "The answer is 4",
      "candidate": "The answer is four"
    }
  ],
  "summary": {
    "total_examples": 100,
    "changed": 3,
    "unchanged": 97,
    "added": 0,
    "removed": 0
  }
}

Working with Artifacts

Reading Records

import json

records = []
with open("run_dir/records.jsonl") as f:
    for line in f:
        records.append(json.loads(line))

Generating Summary

insidellms report ./run_dir --summary-only

Generating HTML Report

insidellms report ./run_dir
# Creates ./run_dir/report.html

Comparing Runs

insidellms diff ./baseline ./candidate

Schema Versions

Artifacts are versioned:

Version Changes
1.0.0 Initial schema
1.0.1 Added run_completed flag

Check version:

record = json.loads(line)
version = record["schema_version"]

Determinism

All artifacts are deterministic:

  • Sorted JSON keys
  • Consistent separators
  • Timestamps derived from run_id
  • Stable record ordering

See Determinism for details.


See Also