Experiment tracking logs metrics, params, and artifacts to a backend (local or hosted). It complements insideLLMs’ deterministic run artifacts (records.jsonl, manifest.json, summary.json, report.html), which are intended for CI diff-gating and reproducibility.
Tracking lives in insideLLMs.experiment_tracking and provides a unified API across backends.
What Gets Tracked
All trackers support the same core interface:
start_run(run_name, run_id, nested)log_metrics(metrics, step)log_params(params)log_artifact(path, name, type)log_experiment_result(result, prefix)end_run(status)
Core data types:
- Run metadata: project, run_name/run_id, tags, notes, start/end timestamps, status
- Metrics: numeric key/value pairs, typically with step and timestamp
- Params/config: key/value pairs (MLflow coerces values to strings)
- Artifacts: files copied or uploaded (backend-specific)
log_experiment_result(...) extracts and logs:
- Metrics:
success_rate,total_count,success_count,error_count - Score metrics (if present):
accuracy,precision,recall,f1_score,mean_latency_ms,total_tokens,error_rate - Duration:
duration_seconds(if present) - Params:
experiment_id,model_name,model_provider,probe_name,probe_category
Artifacts are copied or uploaded; they are not removed from the source path.
TrackingConfig (Shared Settings)
TrackingConfig provides common metadata for all backends:
project(default:insideLLMs)experiment_nametagsnoteslog_artifacts,log_code,auto_log_metrics
Today, built-in trackers use project, experiment_name, tags, and notes. The other flags are reserved for future use.
Backends
LocalFileTracker (local)
- Default output dir:
./experiments - Run ID:
run_nameif provided, elseconfig.experiment_name, else timestampYYYYMMDD_HHMMSS
Layout:
output_dir/
project_name/
run_id/
metadata.json
metrics.json
params.json
artifacts.json
final_state.json
artifacts/
<copied files>
Notes:
- Metrics/params are buffered and written at
end_run. - JSON serialization uses
default=strfor non-JSON values.
WandBTracker (wandb)
- Uses
wandb.init(project=..., entity=..., name=..., tags=..., notes=...) - Extra features:
log_table(...),watch_model(...) - Run ID is assigned by W&B (optional
run_idcan resume)
MLflowTracker (mlflow)
tracking_urioptional; if not set, MLflow usesMLFLOW_TRACKING_URIor local./mlrunsexperiment_namedefaults toconfig.experiment_nameorconfig.project- Extra features:
log_model(...),register_model(...) - MLflow params are stored as strings
TensorBoardTracker (tensorboard / tensorboardX)
log_dirdefault:./runs- Run directory uses
run_nameor ISO timestamp - Params and artifacts are logged as text
- Extra feature:
log_histogram(...)
MultiTracker
Fan-out to multiple backends (e.g., local + W&B).
Enabling Tracking
CLI (run and harness)
insidellms run experiment.yaml --track local --track-project my-project
insidellms run experiment.yaml --track wandb --track-project my-project
insidellms run experiment.yaml --track mlflow --track-project my-project
insidellms run experiment.yaml --track tensorboard --track-project my-project
insidellms harness harness.yaml --track local --track-project my-project
Notes:
- For
local, insideLLMs writes tracking logs under<run_dir_parent>/tracking/<track-project>/<run-id>/. - For
tensorboard, insideLLMs writes TensorBoard logs under<run_dir_parent>/tracking/tensorboard/<track-project>/<run-id>/. - For hosted backends (W&B/MLflow),
--track-projectmaps to the backend’s project/experiment name. - Tracking is best-effort. If the backend dependency is missing, the run continues and tracking is disabled with a warning.
Python API (minimal)
from insideLLMs import create_tracker
with create_tracker("local", output_dir="./experiments") as tracker:
tracker.log_params({"model": "gpt-4"})
tracker.log_metrics({"accuracy": 0.95})
tracker.log_artifact("results.json")
Backend-specific construction:
- W&B:
create_tracker("wandb", project="my-project", entity="my-team", mode="offline") - MLflow:
create_tracker("mlflow", tracking_uri="http://localhost:5000", experiment_name="my-exp") - TensorBoard:
create_tracker("tensorboard", log_dir="./runs")
Auto-track a function:
from insideLLMs import LocalFileTracker, auto_track
@auto_track(LocalFileTracker(output_dir="./experiments"), experiment_name="baseline")
def run_eval():
return {"accuracy": 0.93, "f1": 0.91}
Log an ExperimentResult:
from insideLLMs import LocalFileTracker
# result = ... (ExperimentResult)
with LocalFileTracker(output_dir="./experiments") as tracker:
tracker.log_experiment_result(result, prefix="eval_")
Dependencies and Environment
- local: no extra deps
- wandb:
pip install wandb, authenticate viawandb loginorWANDB_API_KEY - mlflow:
pip install mlflow, optionally setMLFLOW_TRACKING_URI - tensorboard:
pip install tensorboardorpip install tensorboardX(TensorBoardTracker usestorch.utils.tensorboardif available, elsetensorboardX)
CI and Determinism Notes
- Canonical run artifacts are deterministic and used for CI diff-gating.
- Tracking backends are not deterministic: they include timestamps, backend-generated run IDs, and external side effects.
- Treat tracked data as sensitive. Do not store secrets or PII.
- To correlate tracking with run artifacts, log the run ID or run directory as params explicitly.