MPMAs: ensembling

An ensemble combines the predictions of multiple trained MPMAs. EnsembleSweep searches over all combinations of construction strategy, pool size, and diversity weight, then returns the single best configuration. See the Ensemble sweep tutorial for a worked example.

SweepConfig parameters

Parameter	Description
`experiment_dir`	Root directory of the experiment, containing all individual model results.
`optimize_metric`	Metric used to rank ensembles. `"halo"` is the composite HALO score. Alternatives include `"nmcc"`, `"auc"`, `"f1"`.
`mode`	Evaluation split to optimise on. `"inner_validation"` uses the inner CV folds produced during training, so no test data is touched.
`min_models`	Minimum number of member models per ensemble (default `2`).
`max_models`	Maximum number of member models per ensemble.
`pool_sizes`	List of candidate pool sizes to search over. Each value controls how many models are pre-selected before a selection strategy is applied. Defaults to `[50, 100, 200]`.
`diversity_weights`	List of diversity penalty weights to try. Higher values favour ensembles whose members disagree more with each other. Defaults to `[0.1, 0.3, 0.5, 0.7]`.
`prob_models_only`	When `True`, only models that output calibrated probabilities are eligible. Required by all probability-based aggregation strategies.
`include_transforms`	Restrict the candidate pool to models trained with these transformation types.
`exclude_transforms`	Exclude models trained with these transformation types (e.g. `["ILR", "ALR"]`).
`must_include_models`	Force specific model paths to always appear in the ensemble.
`use_calibrated`	Use probability-calibrated model variants when available.
`member_selection_strategies`	List of construction strategies to evaluate. When `None`, all strategies are tried. See Member selection strategies below.
`aggregation_strategies`	List of combination methods to evaluate. Prefix matching is supported (`"power_mean"` matches all variants). When `None`, all methods are tried. See Aggregation strategies below.
`compute_oracle`	When `True`, also computes the oracle upper bound: the best score achievable by any `(K, method)` combination over all models. Useful as a ceiling reference.
`quiet`	Suppress progress output.

Member selection strategies

Member selection determines which trained MPMAs are included in the ensemble.

Strategy	Parameters	Description
`top_k`	`max_models`	Sorts all models by their validation score and selects the top `max_models`. No diversity consideration.
`stratified`	`max_models`	Groups models into strata by `(transform_type, model_family)` and allocates an equal number of slots per stratum, choosing the best scorer per stratum. Remaining slots fill from the global ranking.
`diverse`	`top_k_pool_size`, `diversity_weight`, `max_models`	Pre-selects the top-`pool_size` models, then uses greedy selection maximising a combined score: `(1 − w) × performance + w × (1 − mean_abs_correlation)`, where correlation is computed from Pearson correlation of prediction vectors across all folds.
`diverse_families`	`top_k_pool_size`, `diversity_weight`, `max_models`	Same greedy framework as `diverse`, but the diversity term combines prediction correlation (50%) and algorithm-family novelty (50%): a model from a family not yet in the ensemble gets full family-novelty credit.
`clustered`	`top_k_pool_size`, `max_models`	Builds four separate candidate pools (top-K, diverse, stratified, clustering-based) and searches all of them. The clustering pool uses Ward-linkage hierarchical clustering of `1 −
`iterative`	`top_k_pool_size`, `diversity_weight`, `max_models`	Seeds an ensemble with `diverse` selection, then iteratively improves it: for each position, tries every other candidate as a replacement and keeps the swap if it raises the ensemble’s validation score. Repeats for up to 3 rounds or until no position improves.
`interpretable`	(none)	Restricts the candidate pool to tree-based and linear model families (RandomForest, ExtraTrees, GradientBoosting, HistGradientBoosting, LogisticRegression, DecisionTree, AdaBoost, Bagging, Ridge). Picks the single best per family, hard-capped at 3 models total.
`maximal_diversity`	`top_k_pool_size`, `max_models`	Maximises three-axis diversity: prediction disagreement (30%), algorithm-family novelty (40%), and transformation-type novelty (30%). Performance acts only as a 10% tiebreaker: `0.1 × performance + 0.9 × diversity`.
`borda_ranking`	`max_models`	Computes per-model ranks independently for several metrics (primary validation score, mean AUC, mean nMCC) and for their standard deviations (stability metrics ranked ascending). The final Borda score is the mean rank across all metrics. Models with the lowest average rank are selected.
`greedy_forward`	`top_k_pool_size`, `max_models`	Starts with the single model with the best per-fold AUC, then greedily adds the candidate whose addition to the current ensemble yields the largest AUC gain (using probability averaging). Stops when no candidate gives a positive gain.
`superlearner`	`min_models`, `max_models`	Trains a Lasso-regularised logistic regression meta-learner on the stacked per-fold probability outputs of up to 30 pre-selected models. Cross-validates over 9 regularisation strengths on the inner folds. Models whose absolute meta-learner coefficient exceeds 1% of the maximum are kept.
`shapley_value`	`top_k_pool_size`, `max_models`, `min_models`	Estimates each model’s marginal contribution via Monte-Carlo Shapley values (600 random coalition permutations, seed 42). For each permutation the metric gain from adding each model to its current coalition is accumulated. Models with Shapley value ≤ 0 are excluded.
`best_per_family`	`max_models`	Groups models by algorithm family and selects the single best-scoring representative per family, sorted by family score descending.

Aggregation strategies

Aggregation determines how the selected members’ predictions are combined into a final ensemble prediction.

All probability-based strategies require prob_models_only=True (or calibrated models). When probabilities are not available they fall back to (None, None).

Vote-based

Strategy	Description
`voting`	Uniform majority vote. The most-predicted class wins. When probabilities are available, also returns the arithmetic mean probability vector.
`weighted_voting`	Each model’s vote is weighted by its min-max normalised validation score. The class with the highest total weight wins.

Probability averaging

Strategy	Description
`probability_averaging`	Arithmetic mean of all members’ probability vectors (uniform weights).
`weighted_probability_averaging`	Exponential weighting: `w_i = exp(score_i / 0.1)` after min-shifting. High-scoring models dominate sharply.
`geometric_mean`	Weighted geometric mean in log-space: `exp(Σ w_i · log(p_i))`, renormalised. A single near-zero probability pulls the class probability down strongly.
`trimmed_mean`	Arithmetic mean after trimming the bottom and top 10% of probability values per class (only applied when ≥ 4 models are present).
`median_probability`	Element-wise median across all members’ probability vectors, renormalised to sum to 1. Has a 50% breakdown point: up to half the models can be outliers without affecting the result.
`power_mean_p-1`	Weighted harmonic mean (`p = −1`): `(Σ w_i / p_i)^{−1}`. Conservative combiner: a single model assigning near-zero probability to a class suppresses that class strongly.
`power_mean_p2`	Weighted quadratic mean (`p = 2`): `sqrt(Σ w_i · p_i^2)`. Members with high class probabilities dominate more than in arithmetic mean.

Stacking

Strategy	Description
`stacking_lr`	Trains a logistic regression meta-learner on the stacked probability outputs from inner-validation folds. Uses L2 regularisation, cross-validated over `C ∈ {0.01, 0.1, 1, 10}`.
`hierarchical_stacking`	Pseudo-stacking: normalises member scores to `[0,1]`, creates binary pseudo-labels (`score > median`), and fits a penalty-free logistic regression on those labels. Member weights are proportional to the fitted coefficient times the normalised score.
`bayesian_model_averaging`	Posterior weights via Bayes rule with a uniform prior: `log_w_i ∝ precision × normalised_score`, where `precision = α × n_members` (α = 1.0). Weights computed via log-sum-exp for numerical stability.
`superlearner`	Learns an optimal linear combination of member probabilities using a regularised logistic regression meta-learner (see selection strategy of the same name, which is used here as a combiner rather than a selector).

Sample-adaptive

Strategy	Description
`confidence_weighted`	Base weights from raw validation scores. Per-sample, per-model confidence is `max_prob × (1 − entropy / log(n_classes))`. Each model’s base weight is boosted by `1 + confidence`.
`temperature_scaled`	Assigns a per-model temperature based on score bucket: ≥ 0.9 gives T = 0.8, ≥ 0.8 gives T = 1.0, ≥ 0.7 gives T = 1.2, lower gives T = 1.5. Probabilities are sharpened or softened via `softmax(log(p) / T)` before weighted averaging.
`adaptive_ensemble`	Uses per-sample entropy to estimate difficulty. On hard samples (`normalised_entropy > 0.5`), models with low difficulty (consistently high scores) receive an upward weight adjustment. Rate controlled by `adaptation_rate = 0.1`.
`uncertainty_aware`	Per-sample, per-model prediction uncertainty is `0.6 × (1 − confidence_margin) + 0.4 × normalised_entropy`. Models whose uncertainty exceeds a threshold (0.3) have their weight reduced by `(1 − prediction_uncertainty) × (1 − model_uncertainty)`.
`dynamic_selection`	Per-sample, each model’s competence is `(0.5 × max_prob + 0.5 × margin) × base_score`. Only models with competence ≥ 0.6 contribute. If fewer than 2 qualify, the top-2 by competence are forced in.

Robust / adversarial

Strategy	Description
`robust_consensus`	Groups models by their predicted class. Selects the class with the highest consensus score (`Σ_weights × n_agreers / n_total > 0.7` and ≥ 3 agreeing models). Final probability is the weighted average over agreeing models only. Falls back to global weighted average when no consensus is reached.
`negative_correlation`	Per-sample, models that disagree most from the ensemble mean receive a diversity bonus: `adjusted_weight = base_weight × (1 + λ × mean
`trimmed_mean`	See Probability averaging above.

Rank / tournament

Strategy	Description
`rank_aggregation`	Borda-count aggregation over probability vectors: for each model, classes are ranked by probability (0 = lowest). Weighted rank totals across models determine the final class order.
`copeland`	Pairwise weighted tournament over classes: for each pair of classes, the model’s weight counts towards whichever class has higher probability. Copeland score = number of pairwise wins (+0.5 for ties). Immune to irrelevant alternatives.

Other

Strategy	Description
`switch`	Per-sample, selects the single most confident model (highest `max_prob`) and uses its prediction and probability vector directly. No averaging.
`minimax`	Per-sample, scales each model’s probability vector by its weight, then takes the element-wise minimum across all models. Equivalent to a unanimous-consent combiner: any model that strongly doubts a class suppresses it.
`taxonomic_aware`	Assigns weights based on the taxonomic resolution in the model path. Higher-resolution models (genus, species) receive larger weights. Path keywords (strain/species/…/domain) are detected automatically.