CI Integration Tutorial
Block regressions automatically.
Time: 30 minutes Prerequisites: Git, GitHub Actions
Step 1: Create a Baseline
First, create a deterministic baseline run using DummyModel:
# Create a harness config for CI
mkdir -p ci
cat > ci/harness.yaml << 'EOF'
models:
- type: dummy
args:
name: baseline_model
probes:
- type: logic
dataset:
format: jsonl
path: ci/dataset.jsonl
EOF
# Run the baseline
insidellms harness ci/harness.yaml --run-dir ci/baseline --overwrite --skip-report
Step 2: Verify Determinism
Run again and diff to confirm identical outputs:
insidellms harness ci/harness.yaml --run-dir ci/candidate --overwrite --skip-report
insidellms diff ci/baseline ci/candidate
Expected output:
Comparing runs...
Baseline: ci/baseline
Candidate: ci/candidate
Changes: 0
Status: IDENTICAL
Step 3: Commit the Baseline
git add ci/
git commit -m "Add CI baseline for behavioural testing"
Step 4: Create GitHub Actions Workflow
Create .github/workflows/behavioural-test.yml:
name: Behavioural Tests
on:
pull_request:
branches: [main]
jobs:
behavioural-diff:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: dr-gareth-roberts/insideLLMs@v1
with:
harness-config: ci/harness.yaml
This action runs the harness on the PR branch and the PR base ref, generates a deterministic diff.json, and upserts a sticky pull-request comment with top regressions and changes.
Step 5: Test the Workflow
Push to trigger the workflow:
git add .github/
git commit -m "Add behavioural testing workflow"
git push
The workflow should pass (no changes detected).
Step 6: Simulate a Regression
To test that the CI catches changes, modify something that affects outputs:
# Modify the harness to use a different probe
cat > ci/harness.yaml << 'EOF'
models:
- type: dummy
args:
name: baseline_model
canned_response: "CHANGED RESPONSE" # This changes outputs!
probes:
- type: logic
dataset:
format: jsonl
path: ci/dataset.jsonl
EOF
Push this change — the CI will fail with a diff report.
Advanced Options
Ignore Specific Fields
Some fields are intentionally volatile. Ignore them:
insidellms diff ci/baseline ci/candidate \
--output-fingerprint-ignore latency_ms,timestamps \
--fail-on-changes
Trace and Trajectory Gates
Use dedicated diff gates for trace contracts and agent/tool trajectories:
insidellms diff ci/baseline ci/candidate \
--fail-on-trace-drift \
--fail-on-trace-violations
insidellms diff ci/baseline ci/candidate \
--fail-on-trajectory-drift
Judge-Assisted Triage
Layer deterministic judge triage on top of the same core diff computation:
insidellms diff ci/baseline ci/candidate \
--judge \
--judge-policy balanced \
--judge-limit 50 \
--fail-on-trace-violations
Update Baseline
When changes are intentional:
# Run new baseline
insidellms harness ci/harness.yaml --run-dir ci/baseline --overwrite --skip-report
# Commit the update
git add ci/baseline
git commit -m "Update behavioural baseline: [describe changes]"
Complete Workflow Example
name: Behavioural Tests
on:
push:
branches: [main]
pull_request:
env:
INSIDELLMS_RUN_ROOT: .tmp/runs
jobs:
behavioural-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: pip install -e ".[all]"
- name: Run candidate
run: |
insidellms harness ci/harness.yaml \
--run-dir $/candidate \
--overwrite --skip-report
- name: Compare to baseline
run: |
insidellms diff \
ci/baseline \
$/candidate \
--fail-on-changes \
--output diff-report.json
- name: Upload diff on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: behavioural-diff-$
path: diff-report.json
retention-days: 7
Verification
Baseline committed to repository GitHub Actions workflow created CI passes with no changes CI fails when outputs change
What’s Next?
- Determinism and CI - Understand why this works
- Tracing and Fingerprinting - Advanced diff features
- Troubleshooting - Common CI issues