CI Integration Tutorial
Block regressions automatically.
Time: 30 minutes Prerequisites: Git, GitHub Actions
Step 1: Create a Baseline
First, create a deterministic baseline run using DummyModel:
# Create a harness config for CI
mkdir -p ci
cat > ci/harness.yaml << 'EOF'
models:
- type: dummy
args:
name: baseline_model
probes:
- type: logic
dataset:
format: inline
items:
- question: "What is 2 + 2?"
- question: "Is the sky blue?"
- question: "What comes next: 1, 2, 3, ?"
EOF
# Run the baseline
insidellms harness ci/harness.yaml --run-dir ci/baseline --overwrite --skip-report
Step 2: Verify Determinism
Run again and diff to confirm identical outputs:
insidellms harness ci/harness.yaml --run-dir ci/candidate --overwrite --skip-report
insidellms diff ci/baseline ci/candidate
Expected output:
Comparing runs...
Baseline: ci/baseline
Candidate: ci/candidate
Changes: 0
Status: IDENTICAL
Step 3: Commit the Baseline
git add ci/
git commit -m "Add CI baseline for behavioural testing"
Step 4: Create GitHub Actions Workflow
Create .github/workflows/behavioural-test.yml:
name: Behavioural Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
behavioural-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install insideLLMs
run: pip install -e ".[all]"
- name: Run candidate harness
run: |
insidellms harness ci/harness.yaml \
--run-dir ci/candidate \
--overwrite \
--skip-report
- name: Diff against baseline
run: |
insidellms diff ci/baseline ci/candidate \
--fail-on-changes \
--output ci/diff-report.json
- name: Upload diff report
if: failure()
uses: actions/upload-artifact@v4
with:
name: behavioural-diff
path: ci/diff-report.json
Step 5: Test the Workflow
Push to trigger the workflow:
git add .github/
git commit -m "Add behavioural testing workflow"
git push
The workflow should pass (no changes detected).
Step 6: Simulate a Regression
To test that the CI catches changes, modify something that affects outputs:
# Modify the harness to use a different probe
cat > ci/harness.yaml << 'EOF'
models:
- type: dummy
args:
name: baseline_model
response: "CHANGED RESPONSE" # This changes outputs!
probes:
- type: logic
dataset:
format: inline
items:
- question: "What is 2 + 2?"
- question: "Is the sky blue?"
- question: "What comes next: 1, 2, 3, ?"
EOF
Push this change — the CI will fail with a diff report.
Advanced Options
Ignore Specific Fields
Some fields are intentionally volatile. Ignore them:
insidellms diff ci/baseline ci/candidate \
--ignore-fields latency_ms,timestamps \
--fail-on-changes
Trace-Aware Diffing
For structured outputs with fingerprints:
insidellms diff ci/baseline ci/candidate \
--trace-aware \
--fail-on-trace-violations
Update Baseline
When changes are intentional:
# Run new baseline
insidellms harness ci/harness.yaml --run-dir ci/baseline --overwrite --skip-report
# Commit the update
git add ci/baseline
git commit -m "Update behavioural baseline: [describe changes]"
Complete Workflow Example
name: Behavioural Tests
on:
push:
branches: [main]
pull_request:
env:
INSIDELLMS_RUN_ROOT: .tmp/runs
jobs:
behavioural-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: pip install -e ".[all]"
- name: Run candidate
run: |
insidellms harness ci/harness.yaml \
--run-dir $/candidate \
--overwrite --skip-report
- name: Compare to baseline
run: |
insidellms diff \
ci/baseline \
$/candidate \
--fail-on-changes \
--output diff-report.json
- name: Upload diff on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: behavioural-diff-$
path: diff-report.json
retention-days: 7
Verification
Baseline committed to repository GitHub Actions workflow created CI passes with no changes CI fails when outputs change
What’s Next?
- Determinism and CI - Understand why this works
- Tracing and Fingerprinting - Advanced diff features
- Troubleshooting - Common CI issues