First Harness
10 minutes. Compare two models.
The harness runs identical tests across multiple models. Output: side-by-side comparison.
Config
# my_harness.yaml
models:
- type: dummy
args: {name: baseline}
- type: dummy
args: {name: candidate}
probes:
- type: logic
dataset:
format: inline
items:
- question: "What is 2 + 2?"
- question: "If A > B and B > C, is A > C?"
output_dir: ./harness_results
Run
insidellms harness my_harness.yaml
# Creates: records.jsonl (4 records), summary.json, report.html
View Results
# Raw records
wc -l harness_results/records.jsonl
# 4 (2 models × 2 examples)
# HTML report
open harness_results/report.html
Real Models
models:
- type: openai
args: {model_name: gpt-4o}
- type: anthropic
args: {model_name: claude-3-5-sonnet-20241022}
probes:
- type: logic
- type: bias
dataset:
format: jsonl
path: data/test.jsonl
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
insidellms harness real_harness.yaml
Common Options
--async --concurrency 10 # Parallel execution
--max-examples 50 # Limit dataset
--overwrite # Replace existing
Next
Understanding Outputs → Learn what each artefact contains.