Post-evaluation
run_post_evaluation re-evaluates a completed experiment from the saved fold predictions, without re-running any models. It produces a comprehensive_evaluation_report.md for each model, containing overall metrics with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and a CV stability summary.
Example output
# Model performance report
## Model information
- Task: target-Study.Group_taxonomy-fr_order-family_noagg_transform-none_project-none
- Model: XGBoost_n_estimators-1000
- Task type: multiclass
- Test samples per fold: 44
- Total predictions: 440
- Cross-validation: 2 repeats × 10 folds per repeat
## Overall performance
| Metric | Overall | 95% CI | Mean (Folds) | STD (Folds) | Mean (Repeats) | STD (Repeats) |
|---------------------|---------|------------------|--------------|-------------|----------------|---------------|
| Accuracy | 0.7318 | [0.6886, 0.7705] | 0.7318 | 0.0368 | 0.7318 | 0.0064 |
| Balanced Accuracy | 0.7307 | [0.6881, 0.7732] | 0.7300 | 0.0374 | 0.7307 | 0.0027 |
| F1 Score Weighted | 0.7301 | [0.6883, 0.7728] | 0.7274 | 0.0384 | 0.7305 | 0.0070 |
| MCC | 0.5919 | [0.5270, 0.6557] | 0.5996 | 0.0525 | 0.5920 | 0.0116 |
| nMCC | 0.7960 | [0.7642, 0.8251] | 0.7998 | 0.0263 | 0.7960 | 0.0058 |
| AUC ROC OvR | 0.8803 | [0.8512, 0.9063] | 0.8845 | 0.0401 | 0.8814 | 0.0049 |
| AUC ROC OvO | 0.8811 | [0.8521, 0.9072] | 0.8849 | 0.0421 | 0.8821 | 0.0067 |
| HALO | 0.7144 | [0.6812, 0.7451] | 0.7165 | 0.0394 | 0.7148 | 0.0089 |
## Overall confusion matrix
| | Predicted 0 | Predicted 1 | Predicted 2 |
|------------|-------------|-------------|-------------|
| Actual 0 | 143 | 6 | 27 |
| Actual 1 | 5 | 86 | 21 |
| Actual 2 | 33 | 26 | 93 |
## Cross-validation stability
| Metric | CV | Stability |
|-------------------|-------|-----------|
| Accuracy | 0.050 | High |
| F1 Score Weighted | 0.053 | High |
| MCC | 0.088 | High |
| nMCC | 0.033 | High |Running post-evaluation
from mllabiome.eval_space.posteval import run_post_evaluation
evaluator = run_post_evaluation(
directory="results/ibd_franzosa/siso/target-Study.Group/taxonomy-fr_order-family_noagg/transform-none/project-none/models/XGBoost_n_estimators-1000",
bootstrap=True,
)The directory argument can point at any level of the results tree: a single model directory as above, a models/ container, a task directory, or the full experiment directory. The function discovers all (task, model) pairs automatically and evaluates each.
Parameters
| Parameter | Default | Description |
|---|---|---|
directory | required | Path to a model, models container, task, or experiment directory. |
bootstrap | True | Compute bootstrap confidence intervals for every metric. Set to False for faster results — CI columns will be absent from reports. |
confidence_level | 0.95 | Confidence level for the intervals. |
n_bootstrap | 1000 | Number of bootstrap resamples. |
verbose | True | Print progress to stdout. |
main_title | None | Optional title for generated reports and plots. |
What is written to disk
For each (task, model) pair, the following files are written into a metrics/ subdirectory next to the model:
comprehensive_evaluation_report.md— full metric table with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and CV stability.roc_curve_overall.parquet,roc_curve_folds.parquet— ROC data (binary classification only).roc_*.png— two ROC plots: one showing fold variability, one showing bootstrap uncertainty.
After all pairs, experiment_summary.md is written at the root of directory with a ranking table across all task-model combinations.
Accessing results programmatically
run_post_evaluation returns a PostEvaluator. The aggregated results for every evaluated pair are accessible as:
evaluator.results # dict keyed by "{task_name}_{model_name}"Each entry contains overall_metrics, fold_metrics, repeat_metrics, and — when bootstrap=True — a "{metric}_ci" tuple for every metric.