Experiment Tracking
Log runs to external tracking systems for visualisation and comparison.
Supported Backends
| Backend | Use Case |
|---|---|
| Local File | Simple file-based logging |
| Weights & Biases | Team collaboration, dashboards |
| MLflow | Model lifecycle management |
| TensorBoard | TensorFlow ecosystem |
Enabling Tracking
In Config
tracking:
enabled: true
backend: wandb
project: my-llm-evaluation
tags:
- experiment
- v1
Programmatically
from insideLLMs.tracking import get_tracker
tracker = get_tracker("wandb", project="my-llm-evaluation")
with tracker.start_run(name="bias-test"):
# Run your experiment
results = runner.run(prompt_set)
# Log metrics
tracker.log_metrics({
"success_rate": runner.success_rate,
"error_count": runner.error_count
})
# Log artifacts
tracker.log_artifact("results", results)
Local File
Simple file-based logging for local analysis.
Config
tracking:
enabled: true
backend: local
output_dir: ./tracking_logs
What Gets Logged
tracking_logs/
├── run_abc123/
│ ├── metrics.json
│ ├── params.json
│ └── artifacts/
│ └── results.json
Weights & Biases
Full-featured experiment tracking with dashboards.
Setup
pip install wandb
wandb login
Config
tracking:
enabled: true
backend: wandb
project: llm-evaluation
entity: my-team # Optional
tags:
- production
Features
- Real-time dashboards
- Team collaboration
- Hyperparameter comparison
- Artifact versioning
Viewing Results
# Link printed after run
# https://wandb.ai/my-team/llm-evaluation/runs/abc123
MLflow
Open-source platform for ML lifecycle.
Setup
pip install mlflow
# Start tracking server (optional)
mlflow server --host 127.0.0.1 --port 5000
Config
tracking:
enabled: true
backend: mlflow
tracking_uri: http://localhost:5000
experiment_name: llm-evaluation
Features
- Model registry
- Experiment comparison
- Deployment integration
- Open source
Viewing Results
mlflow ui
# Open http://localhost:5000
TensorBoard
TensorFlow’s visualisation toolkit.
Setup
pip install tensorboard
Config
tracking:
enabled: true
backend: tensorboard
log_dir: ./tb_logs
Viewing Results
tensorboard --logdir ./tb_logs
# Open http://localhost:6006
What Gets Tracked
| Category | Examples |
|---|---|
| Metrics | success_rate, error_count, latency |
| Parameters | model_name, temperature, probe_type |
| Artifacts | records.jsonl, summary.json, config |
| Tags | experiment name, version, environment |
Custom Metrics
tracker.log_metrics({
"custom_score": calculate_score(results),
"bias_index": calculate_bias(results),
})
Custom Parameters
tracker.log_params({
"model": "gpt-4o",
"temperature": 0.7,
"max_tokens": 1000,
})
Comparison with Deterministic Artifacts
| Aspect | Tracking | Artifacts |
|---|---|---|
| Purpose | Analysis, dashboards | CI, reproducibility |
| Format | Backend-specific | Stable JSON |
| Determinism | No | Yes |
| Team sharing | Yes | Via git |
Use both:
- Tracking for exploration and visualisation
- Artifacts for CI diff-gating and reproducibility
Best Practices
Do
- Tag experiments meaningfully
- Log hyperparameters consistently
- Use project/experiment hierarchy
- Include model and dataset info
Don’t
- Rely on tracking for determinism
- Log sensitive data (API keys, PII)
- Skip local testing before logging
See Also
- Understanding Outputs - Deterministic artifacts
- Determinism - Why artifacts are separate