Experiment Tracking

Experiment tracking logs metrics, params, and artifacts to a backend (local or hosted). It complements insideLLMs’ deterministic run artifacts (records.jsonl, manifest.json, summary.json, report.html), which are intended for CI diff-gating and reproducibility.

Tracking lives in insideLLMs.experiment_tracking and provides a unified API across backends.

What Gets Tracked

All trackers support the same core interface:

start_run(run_name, run_id, nested)
log_metrics(metrics, step)
log_params(params)
log_artifact(path, name, type)
log_experiment_result(result, prefix)
end_run(status)

Core data types:

Run metadata: project, run_name/run_id, tags, notes, start/end timestamps, status
Metrics: numeric key/value pairs, typically with step and timestamp
Params/config: key/value pairs (MLflow coerces values to strings)
Artifacts: files copied or uploaded (backend-specific)

log_experiment_result(...) extracts and logs:

Metrics: success_rate, total_count, success_count, error_count
Score metrics (if present): accuracy, precision, recall, f1_score, mean_latency_ms, total_tokens, error_rate
Duration: duration_seconds (if present)
Params: experiment_id, model_name, model_provider, probe_name, probe_category

Artifacts are copied or uploaded; they are not removed from the source path.

TrackingConfig (Shared Settings)

TrackingConfig provides common metadata for all backends:

project (default: insideLLMs)
experiment_name
tags
notes
log_artifacts, log_code, auto_log_metrics

Today, built-in trackers use project, experiment_name, tags, and notes. The other flags are reserved for future use.

Backends

LocalFileTracker (local)

Default output dir: ./experiments
Run ID: run_name if provided, else config.experiment_name, else timestamp YYYYMMDD_HHMMSS

Layout:

output_dir/
  project_name/
    run_id/
      metadata.json
      metrics.json
      params.json
      artifacts.json
      final_state.json
      artifacts/
        <copied files>

Notes:

Metrics/params are buffered and written at end_run.
JSON serialization uses default=str for non-JSON values.

WandBTracker (wandb)

Uses wandb.init(project=..., entity=..., name=..., tags=..., notes=...)
Extra features: log_table(...), watch_model(...)
Run ID is assigned by W&B (optional run_id can resume)

MLflowTracker (mlflow)

tracking_uri optional; if not set, MLflow uses MLFLOW_TRACKING_URI or local ./mlruns
experiment_name defaults to config.experiment_name or config.project
Extra features: log_model(...), register_model(...)
MLflow params are stored as strings

TensorBoardTracker (tensorboard / tensorboardX)

log_dir default: ./runs
Run directory uses run_name or ISO timestamp
Params and artifacts are logged as text
Extra feature: log_histogram(...)

MultiTracker

Fan-out to multiple backends (e.g., local + W&B).

Enabling Tracking

CLI (`run` and `harness`)

insidellms run experiment.yaml --track local --track-project my-project
insidellms run experiment.yaml --track wandb --track-project my-project
insidellms run experiment.yaml --track mlflow --track-project my-project
insidellms run experiment.yaml --track tensorboard --track-project my-project

insidellms harness harness.yaml --track local --track-project my-project

Notes:

For local, insideLLMs writes tracking logs under <run_dir_parent>/tracking/<track-project>/<run-id>/.
For tensorboard, insideLLMs writes TensorBoard logs under <run_dir_parent>/tracking/tensorboard/<track-project>/<run-id>/.
For hosted backends (W&B/MLflow), --track-project maps to the backend’s project/experiment name.
Tracking is best-effort. If the backend dependency is missing, the run continues and tracking is disabled with a warning.

Python API (minimal)

from insideLLMs import create_tracker

with create_tracker("local", output_dir="./experiments") as tracker:
    tracker.log_params({"model": "gpt-4"})
    tracker.log_metrics({"accuracy": 0.95})
    tracker.log_artifact("results.json")

Backend-specific construction:

W&B: create_tracker("wandb", project="my-project", entity="my-team", mode="offline")
MLflow: create_tracker("mlflow", tracking_uri="http://localhost:5000", experiment_name="my-exp")
TensorBoard: create_tracker("tensorboard", log_dir="./runs")

Auto-track a function:

from insideLLMs import LocalFileTracker, auto_track

@auto_track(LocalFileTracker(output_dir="./experiments"), experiment_name="baseline")
def run_eval():
    return {"accuracy": 0.93, "f1": 0.91}

Log an ExperimentResult:

from insideLLMs import LocalFileTracker

# result = ... (ExperimentResult)
with LocalFileTracker(output_dir="./experiments") as tracker:
    tracker.log_experiment_result(result, prefix="eval_")

Dependencies and Environment

local: no extra deps
wandb: pip install wandb, authenticate via wandb login or WANDB_API_KEY
mlflow: pip install mlflow, optionally set MLFLOW_TRACKING_URI
tensorboard: pip install tensorboard or pip install tensorboardX (TensorBoardTracker uses torch.utils.tensorboard if available, else tensorboardX)

CI and Determinism Notes

Canonical run artifacts are deterministic and used for CI diff-gating.
Tracking backends are not deterministic: they include timestamps, backend-generated run IDs, and external side effects.
Treat tracked data as sensitive. Do not store secrets or PII.
To correlate tracking with run artifacts, log the run ID or run directory as params explicitly.