Skip to Content

Post-evaluation

run_post_evaluation re-evaluates a completed experiment from the saved fold predictions, without re-running any models. It produces a comprehensive_evaluation_report.md for each model, containing overall metrics with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and a CV stability summary.

Example output

# Model performance report ## Model information - Task: target-Study.Group_taxonomy-fr_order-family_noagg_transform-none_project-none - Model: XGBoost_n_estimators-1000 - Task type: multiclass - Test samples per fold: 44 - Total predictions: 440 - Cross-validation: 2 repeats × 10 folds per repeat ## Overall performance | Metric | Overall | 95% CI | Mean (Folds) | STD (Folds) | Mean (Repeats) | STD (Repeats) | |---------------------|---------|------------------|--------------|-------------|----------------|---------------| | Accuracy | 0.7318 | [0.6886, 0.7705] | 0.7318 | 0.0368 | 0.7318 | 0.0064 | | Balanced Accuracy | 0.7307 | [0.6881, 0.7732] | 0.7300 | 0.0374 | 0.7307 | 0.0027 | | F1 Score Weighted | 0.7301 | [0.6883, 0.7728] | 0.7274 | 0.0384 | 0.7305 | 0.0070 | | MCC | 0.5919 | [0.5270, 0.6557] | 0.5996 | 0.0525 | 0.5920 | 0.0116 | | nMCC | 0.7960 | [0.7642, 0.8251] | 0.7998 | 0.0263 | 0.7960 | 0.0058 | | AUC ROC OvR | 0.8803 | [0.8512, 0.9063] | 0.8845 | 0.0401 | 0.8814 | 0.0049 | | AUC ROC OvO | 0.8811 | [0.8521, 0.9072] | 0.8849 | 0.0421 | 0.8821 | 0.0067 | | HALO | 0.7144 | [0.6812, 0.7451] | 0.7165 | 0.0394 | 0.7148 | 0.0089 | ## Overall confusion matrix | | Predicted 0 | Predicted 1 | Predicted 2 | |------------|-------------|-------------|-------------| | Actual 0 | 143 | 6 | 27 | | Actual 1 | 5 | 86 | 21 | | Actual 2 | 33 | 26 | 93 | ## Cross-validation stability | Metric | CV | Stability | |-------------------|-------|-----------| | Accuracy | 0.050 | High | | F1 Score Weighted | 0.053 | High | | MCC | 0.088 | High | | nMCC | 0.033 | High |

Running post-evaluation

from mllabiome.eval_space.posteval import run_post_evaluation evaluator = run_post_evaluation( directory="results/ibd_franzosa/siso/target-Study.Group/taxonomy-fr_order-family_noagg/transform-none/project-none/models/XGBoost_n_estimators-1000", bootstrap=True, )

The directory argument can point at any level of the results tree: a single model directory as above, a models/ container, a task directory, or the full experiment directory. The function discovers all (task, model) pairs automatically and evaluates each.

Parameters

ParameterDefaultDescription
directoryrequiredPath to a model, models container, task, or experiment directory.
bootstrapTrueCompute bootstrap confidence intervals for every metric. Set to False for faster results — CI columns will be absent from reports.
confidence_level0.95Confidence level for the intervals.
n_bootstrap1000Number of bootstrap resamples.
verboseTruePrint progress to stdout.
main_titleNoneOptional title for generated reports and plots.

What is written to disk

For each (task, model) pair, the following files are written into a metrics/ subdirectory next to the model:

  • comprehensive_evaluation_report.md — full metric table with confidence intervals, per-repeat and per-fold breakdowns, confusion matrices, and CV stability.
  • roc_curve_overall.parquet, roc_curve_folds.parquet — ROC data (binary classification only).
  • roc_*.png — two ROC plots: one showing fold variability, one showing bootstrap uncertainty.

After all pairs, experiment_summary.md is written at the root of directory with a ranking table across all task-model combinations.

Accessing results programmatically

run_post_evaluation returns a PostEvaluator. The aggregated results for every evaluated pair are accessible as:

evaluator.results # dict keyed by "{task_name}_{model_name}"

Each entry contains overall_metrics, fold_metrics, repeat_metrics, and — when bootstrap=True — a "{metric}_ci" tuple for every metric.

Last updated on