Experiment Tracking
Log runs to external tracking systems for visualisation and comparison.
Supported Backends
| Backend | Use Case |
|---|---|
| Local File | Simple file-based logging |
| Weights & Biases | Team collaboration, dashboards |
| MLflow | Model lifecycle management |
| TensorBoard | TensorFlow ecosystem |
Enabling Tracking
Via CLI Flags
insidellms run experiment.yaml --track wandb --track-project my-llm-evaluation
insidellms harness harness.yaml --track local --track-project my-llm-evaluation
Programmatically
from insideLLMs.experiment_tracking import create_tracker
tracker = create_tracker("wandb", project="my-llm-evaluation")
with tracker:
# Run your experiment
results = runner.run(prompt_set)
# Log metrics
tracker.log_metrics({
"success_rate": runner.success_rate,
"error_count": runner.error_count
})
# Log artifacts
tracker.log_artifact("./runs/bias-test/summary.json")
Local File
Simple file-based logging for local analysis.
CLI
insidellms run experiment.yaml --track local --track-project my-llm-evaluation
Local tracking is written under <run_dir_parent>/tracking/<track-project>/<run-id>/.
What Gets Logged
tracking/
└── my-llm-evaluation/
└── <run_id>/
├── metadata.json
├── metrics.json
├── params.json
├── artifacts.json
├── final_state.json
└── artifacts/
└── <copied files>
Weights & Biases
Full-featured experiment tracking with dashboards.
Setup
pip install wandb
wandb login
CLI
insidellms run experiment.yaml --track wandb --track-project llm-evaluation
Features
- Real-time dashboards
- Team collaboration
- Hyperparameter comparison
- Artifact versioning
Viewing Results
# Link printed after run
# https://wandb.ai/my-team/llm-evaluation/runs/abc123
MLflow
Open-source platform for ML lifecycle.
Setup
pip install mlflow
# Start tracking server (optional)
mlflow server --host 127.0.0.1 --port 5000
CLI
# Optional: set remote tracking server
export MLFLOW_TRACKING_URI=http://localhost:5000
insidellms run experiment.yaml --track mlflow --track-project llm-evaluation
Features
- Model registry
- Experiment comparison
- Deployment integration
- Open source
Viewing Results
mlflow ui
# Open http://localhost:5000
TensorBoard
TensorFlow’s visualisation toolkit.
Setup
pip install tensorboard
CLI
insidellms run experiment.yaml --track tensorboard --track-project llm-evaluation
TensorBoard logs are written under <run_dir_parent>/tracking/tensorboard/<track-project>/<run-id>/.
Viewing Results
tensorboard --logdir ~/.insidellms/runs/tracking/tensorboard/llm-evaluation
# Open http://localhost:6006
What Gets Tracked
| Category | Examples |
|---|---|
| Metrics | success_rate, error_count, latency |
| Parameters | model_name, temperature, probe_type |
| Artifacts | records.jsonl, summary.json, config |
| Tags | experiment name, version, environment |
Custom Metrics
tracker.log_metrics({
"custom_score": calculate_score(results),
"bias_index": calculate_bias(results),
})
Custom Parameters
tracker.log_params({
"model": "gpt-4o",
"temperature": 0.7,
"max_tokens": 1000,
})
Comparison with Deterministic Artifacts
| Aspect | Tracking | Artifacts |
|---|---|---|
| Purpose | Analysis, dashboards | CI, reproducibility |
| Format | Backend-specific | Stable JSON |
| Determinism | No | Yes |
| Team sharing | Yes | Via git |
Use both:
- Tracking for exploration and visualisation
- Artifacts for CI diff-gating and reproducibility
Best Practices
Do
- Tag experiments meaningfully
- Log hyperparameters consistently
- Use project/experiment hierarchy
- Include model and dataset info
Don’t
- Rely on tracking for determinism
- Log sensitive data (API keys, PII)
- Skip local testing before logging
See Also
- Understanding Outputs - Deterministic artifacts
- Determinism - Why artifacts are separate