Post-evaluation

run_post_evaluation re-evaluates a completed experiment from the saved fold predictions, without re-running any models. It produces a comprehensive_evaluation_report.md for each model, containing overall metrics with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and a CV stability summary.

Example output


# Model performance report
 
## Model information
- Task: target-Study.Group_taxonomy-fr_order-family_noagg_transform-none_project-none
- Model: XGBoost_n_estimators-1000
- Task type: multiclass
- Test samples per fold: 44
- Total predictions: 440
- Cross-validation: 2 repeats × 10 folds per repeat
 
## Overall performance
 
| Metric              | Overall | 95% CI           | Mean (Folds) | STD (Folds) | Mean (Repeats) | STD (Repeats) |
|---------------------|---------|------------------|--------------|-------------|----------------|---------------|
| Accuracy            | 0.7318  | [0.6886, 0.7705] | 0.7318       | 0.0368      | 0.7318         | 0.0064        |
| Balanced Accuracy   | 0.7307  | [0.6881, 0.7732] | 0.7300       | 0.0374      | 0.7307         | 0.0027        |
| F1 Score Weighted   | 0.7301  | [0.6883, 0.7728] | 0.7274       | 0.0384      | 0.7305         | 0.0070        |
| MCC                 | 0.5919  | [0.5270, 0.6557] | 0.5996       | 0.0525      | 0.5920         | 0.0116        |
| nMCC                | 0.7960  | [0.7642, 0.8251] | 0.7998       | 0.0263      | 0.7960         | 0.0058        |
| AUC ROC OvR         | 0.8803  | [0.8512, 0.9063] | 0.8845       | 0.0401      | 0.8814         | 0.0049        |
| AUC ROC OvO         | 0.8811  | [0.8521, 0.9072] | 0.8849       | 0.0421      | 0.8821         | 0.0067        |
| HALO                | 0.7144  | [0.6812, 0.7451] | 0.7165       | 0.0394      | 0.7148         | 0.0089        |
 
## Overall confusion matrix
 
|            | Predicted 0 | Predicted 1 | Predicted 2 |
|------------|-------------|-------------|-------------|
| Actual 0   | 143         | 6           | 27          |
| Actual 1   | 5           | 86          | 21          |
| Actual 2   | 33          | 26          | 93          |
 
## Cross-validation stability
 
| Metric            | CV    | Stability |
|-------------------|-------|-----------|
| Accuracy          | 0.050 | High      |
| F1 Score Weighted | 0.053 | High      |
| MCC               | 0.088 | High      |
| nMCC              | 0.033 | High      |

Running post-evaluation


from mllabiome.eval_space.posteval import run_post_evaluation
 
evaluator = run_post_evaluation(
    directory="results/ibd_franzosa/siso/target-Study.Group/taxonomy-fr_order-family_noagg/transform-none/project-none/models/XGBoost_n_estimators-1000",
    bootstrap=True,
)

The directory argument can point at any level of the results tree: a single model directory as above, a models/ container, a task directory, or the full experiment directory. The function discovers all (task, model) pairs automatically and evaluates each.

Parameters

Parameter	Default	Description
`directory`	required	Path to a model, models container, task, or experiment directory.
`bootstrap`	`True`	Compute bootstrap confidence intervals for every metric. Set to `False` for faster results — CI columns will be absent from reports.
`confidence_level`	`0.95`	Confidence level for the intervals.
`n_bootstrap`	`1000`	Number of bootstrap resamples.
`verbose`	`True`	Print progress to stdout.
`main_title`	`None`	Optional title for generated reports and plots.

What is written to disk

For each (task, model) pair, the following files are written into a metrics/ subdirectory next to the model:

comprehensive_evaluation_report.md — full metric table with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and CV stability.
roc_curve_overall.parquet, roc_curve_folds.parquet — ROC data (binary classification only).
roc_*.png — two ROC plots: one showing fold variability, one showing bootstrap uncertainty.

After all pairs, experiment_summary.md is written at the root of directory with a ranking table across all task-model combinations.

Accessing results programmatically

run_post_evaluation returns a PostEvaluator. The aggregated results for every evaluated pair are accessible as:


evaluator.results  # dict keyed by "{task_name}_{model_name}"

Each entry contains overall_metrics, fold_metrics, repeat_metrics, and — when bootstrap=True — a "{metric}_ci" tuple for every metric.