Datasets

Datasets provide the inputs that probes use to test models.

Supported Formats

Format	Extension	Use Case
JSONL	`.jsonl`	Structured data with fields
CSV	`.csv`	Tabular data
Inline	(in config)	Small test sets
HuggingFace	(remote)	Standard benchmarks

JSONL Format

Most flexible format. One JSON object per line:

{"question": "What is 2 + 2?", "expected": "4"}
{"question": "What colour is the sky?", "expected": "blue"}
{"question": "Name a prime number", "expected": "2"}

Config

dataset:
  format: jsonl
  path: data/test.jsonl

Loading Programmatically

from insideLLMs.dataset_utils import load_jsonl_dataset

items = load_jsonl_dataset("data/test.jsonl")

CSV Format

For tabular data:

question,expected
"What is 2 + 2?","4"
"What colour is the sky?","blue"

Config

dataset:
  format: csv
  path: data/test.csv
  columns:
    question: question
    expected: expected

Loading Programmatically

from insideLLMs.dataset_utils import load_csv_dataset

items = load_csv_dataset("data/test.csv")

Inline Format

For small datasets directly in config:

dataset:
  format: inline
  items:
    - question: "What is 2 + 2?"
      expected: "4"
    - question: "What colour is the sky?"
      expected: "blue"

HuggingFace Datasets

Load standard benchmarks:

dataset:
  format: huggingface
  name: cais/mmlu
  split: test
  subset: abstract_algebra
  max_examples: 100

Programmatically

from insideLLMs.dataset_utils import load_huggingface_dataset

items = load_huggingface_dataset(
    name="cais/mmlu",
    split="test",
    subset="abstract_algebra"
)

Content Hashing

Local datasets are content-addressed:

dataset:
  format: jsonl
  path: data/test.jsonl
  # Automatically computed:
  dataset_hash: sha256:abc123def456...

The hash is included in the run_id, ensuring:

Different data → Different run_id
Same data → Same run_id (determinism)

Path Resolution

Relative paths resolve from the config file’s directory:

project/
├── configs/
│   └── harness.yaml      # dataset.path: ../data/test.jsonl
└── data/
    └── test.jsonl        # ← Resolved path

Limiting Examples

For development/testing:

dataset:
  format: jsonl
  path: data/large_dataset.jsonl

max_examples: 50  # Only use first 50

Or via CLI:

insidellms run config.yaml --max-examples 50

Dataset Registry

from insideLLMs.registry import dataset_registry

def load_my_format(path, **kwargs):
    # Custom loading logic
    return items

dataset_registry.register("my_format", load_my_format)

Then in config:

dataset:
  format: my_format
  path: data/custom.xyz

Input Structure

Probes expect specific input structures:

Simple String

"What is the capital of France?"

Dict with Fields

{"question": "...", "expected": "..."}
{"prompt": "...", "constraints": [...]}

Chat Messages

{"messages": [{"role": "user", "content": "Hello!"}]}

Check Probes Catalog for each probe’s expected format.

Best Practices

Do

Use JSONL for structured data
Include expected for evaluation
Use meaningful field names
Keep datasets version-controlled

Don’t

Include sensitive data
Use very large inline datasets
Rely on file modification times
Use absolute paths in configs

Datasets

Supported Formats

JSONL Format

Config

Loading Programmatically

CSV Format

Config

Loading Programmatically

Inline Format

HuggingFace Datasets

Programmatically

Content Hashing

Path Resolution

Limiting Examples

Dataset Registry

Input Structure

Simple String

Dict with Fields

Chat Messages

Best Practices

Do

Don’t

See Also