Experiment Tracking

Log runs to external tracking systems for visualisation and comparison.

Supported Backends

Backend Use Case
Local File Simple file-based logging
Weights & Biases Team collaboration, dashboards
MLflow Model lifecycle management
TensorBoard TensorFlow ecosystem

Enabling Tracking

In Config

tracking:
  enabled: true
  backend: wandb
  project: my-llm-evaluation
  tags:
    - experiment
    - v1

Programmatically

from insideLLMs.tracking import get_tracker

tracker = get_tracker("wandb", project="my-llm-evaluation")

with tracker.start_run(name="bias-test"):
    # Run your experiment
    results = runner.run(prompt_set)
    
    # Log metrics
    tracker.log_metrics({
        "success_rate": runner.success_rate,
        "error_count": runner.error_count
    })
    
    # Log artifacts
    tracker.log_artifact("results", results)

Local File

Simple file-based logging for local analysis.

Config

tracking:
  enabled: true
  backend: local
  output_dir: ./tracking_logs

What Gets Logged

tracking_logs/
├── run_abc123/
│   ├── metrics.json
│   ├── params.json
│   └── artifacts/
│       └── results.json

Weights & Biases

Full-featured experiment tracking with dashboards.

Setup

pip install wandb
wandb login

Config

tracking:
  enabled: true
  backend: wandb
  project: llm-evaluation
  entity: my-team  # Optional
  tags:
    - production

Features

  • Real-time dashboards
  • Team collaboration
  • Hyperparameter comparison
  • Artifact versioning

Viewing Results

# Link printed after run
# https://wandb.ai/my-team/llm-evaluation/runs/abc123

MLflow

Open-source platform for ML lifecycle.

Setup

pip install mlflow

# Start tracking server (optional)
mlflow server --host 127.0.0.1 --port 5000

Config

tracking:
  enabled: true
  backend: mlflow
  tracking_uri: http://localhost:5000
  experiment_name: llm-evaluation

Features

  • Model registry
  • Experiment comparison
  • Deployment integration
  • Open source

Viewing Results

mlflow ui
# Open http://localhost:5000

TensorBoard

TensorFlow’s visualisation toolkit.

Setup

pip install tensorboard

Config

tracking:
  enabled: true
  backend: tensorboard
  log_dir: ./tb_logs

Viewing Results

tensorboard --logdir ./tb_logs
# Open http://localhost:6006

What Gets Tracked

Category Examples
Metrics success_rate, error_count, latency
Parameters model_name, temperature, probe_type
Artifacts records.jsonl, summary.json, config
Tags experiment name, version, environment

Custom Metrics

tracker.log_metrics({
    "custom_score": calculate_score(results),
    "bias_index": calculate_bias(results),
})

Custom Parameters

tracker.log_params({
    "model": "gpt-4o",
    "temperature": 0.7,
    "max_tokens": 1000,
})

Comparison with Deterministic Artifacts

Aspect Tracking Artifacts
Purpose Analysis, dashboards CI, reproducibility
Format Backend-specific Stable JSON
Determinism No Yes
Team sharing Yes Via git

Use both:

  • Tracking for exploration and visualisation
  • Artifacts for CI diff-gating and reproducibility

Best Practices

Do

  • Tag experiments meaningfully
  • Log hyperparameters consistently
  • Use project/experiment hierarchy
  • Include model and dataset info

Don’t

  • Rely on tracking for determinism
  • Log sensitive data (API keys, PII)
  • Skip local testing before logging

See Also