Skip to Content
DocumentationMicrobiome profile modelling algorithm (MPMA)

Microbiome profile modelling algorithm (MPMA)

An MPMA is a complete machine learning pipeline: an MPDR (taxonomic resolution + transformation + optional projection) combined with a base learner.

Available base learners

Base learners are passed as instances of their config class. Constructor arguments override the default hyperparameter values. For example:

import mllabiome as mll mll.XGBoost(n_estimators=1000) mll.RandomForestClassifier(n_estimators=500, min_samples_leaf=3) mll.BernoulliNB()

operations/learners_configs.py provides a ready-to-use list of all available base learners with default parameters. The full set is organised by family below. Classes whose name ends in Classifier are classification-only; classes ending in Regressor are regression-only; unmarked classes support both tasks unless noted.

Linear models

Ridge · Lasso · ElasticNet · SGDRegressor · PassiveAggressiveRegressor · HuberRegressor · LogisticRegression · RidgeClassifier · RidgeClassifierCV · SGDClassifier · PassiveAggressiveClassifier

Tree-based models

DecisionTreeRegressor · DecisionTreeClassifier · RandomForestRegressor · RandomForestClassifier · ExtraTreesRegressor · ExtraTreesClassifier · GradientBoostingRegressor · GradientBoostingClassifier · HistGradientBoostingRegressor · HistGradientBoostingClassifier

Boosting models

XGBoost · LightGBM · CatBoost · AdaBoostRegressor · AdaBoostClassifier

SVM models

SVR · SVC · NuSVR · NuSVC · LinearSVR · LinearSVC

Neighbor models

KNeighborsRegressor · KNeighborsClassifier · NearestCentroid

Naive Bayes models

GaussianNB · BernoulliNB

Discriminant analysis models

LinearDiscriminantAnalysis · QuadraticDiscriminantAnalysis

Ensemble method models

BaggingRegressor · BaggingClassifier

Neural network models

MLPRegressor · MLPClassifier · SimpleNN · WideNN · TabPFN · TabICL · SCARF · DeepMicroAE · DeepMicroVAE · DeepMicroCAE · PhyloFormer

Probabilistic and mixture models

GaussianNB · GMMClassifier · BayesianGaussianMixtureClassifier · FactorAnalysis (unsupervised)

Semi-supervised and AutoML

SelfTraining · FLAMLClassifier · TabPFNClassifier · TabICLClassifier

Prompting

OllamaClassifier

Clustering models

DBSCAN · HDBSCAN · MeanShift · Birch · AgglomerativeClustering · SpectralClustering

Custom models

New models can be added at runtime using the @register_model decorator. Once registered, a custom model is available through ExperimentConfiguration and participates in hyperparameter optimisation alongside built-in learners.

Registering a scikit-learn compatible model

Subclass BaseEstimator and ClassifierMixin (or RegressorMixin), then decorate the class:

import numpy as np from sklearn.base import BaseEstimator, ClassifierMixin from mllabiome import register_model @register_model( name="WeightedKNN", task_types=["binary", "multiclass"], params={ "n_neighbors": {"type": "int", "range": [3, 30], "default": 7}, "weight_power": {"type": "float", "range": [0.5, 3.0], "default": 1.0}, }, ) class WeightedKNN(BaseEstimator, ClassifierMixin): def __init__(self, n_neighbors=7, weight_power=1.0, random_state=42): self.n_neighbors = n_neighbors self.weight_power = weight_power self.random_state = random_state def fit(self, X, y): self.classes_ = np.unique(y) self.X_train_ = X self.y_train_ = y return self def predict(self, X): from sklearn.metrics import pairwise_distances dists = pairwise_distances(X, self.X_train_) idx = np.argsort(dists, axis=1)[:, : self.n_neighbors] neighbor_labels = self.y_train_[idx] return np.array( [np.bincount(row, minlength=len(self.classes_)).argmax() for row in neighbor_labels] ) def predict_proba(self, X): from sklearn.metrics import pairwise_distances dists = pairwise_distances(X, self.X_train_) idx = np.argsort(dists, axis=1)[:, : self.n_neighbors] weights = 1.0 / (dists[np.arange(len(X))[:, None], idx] ** self.weight_power + 1e-10) proba = np.zeros((len(X), len(self.classes_))) for c_idx, c in enumerate(self.classes_): mask = self.y_train_[idx] == c proba[:, c_idx] = (weights * mask).sum(axis=1) return proba / proba.sum(axis=1, keepdims=True)

Key requirements:

  • Store every constructor parameter as an attribute with the same name (scikit-learn convention).
  • Set self.classes_ during fit.
  • Return self from fit.
  • Implement predict_proba for classifiers used in ensemble voting or calibration.

Decorator parameters

ParameterTypeDefaultDescription
namestrrequiredUnique identifier used in experiment configs.
task_typeslist[str]allSupported tasks: "binary", "multiclass", "regression", "multilabel".
paramsdict{}Hyperparameter search space for optimisation.
frameworkstr"sklearn""sklearn", "pytorch", or "xgboost".
constraintslist[dict][]Parameter constraints (see below).
descriptionstr""Human-readable summary.

Hyperparameter types

TypeFormatExample
"int"{"type": "int", "range": [lo, hi], "default": val}{"type": "int", "range": [10, 200], "default": 100}
"float"{"type": "float", "range": [lo, hi], "default": val}{"type": "float", "range": [0.0, 1.0], "default": 0.5}
"loguniform"{"type": "loguniform", "range": [lo, hi], "default": val}{"type": "loguniform", "range": [1e-4, 0.1], "default": 1e-3}
"categorical"{"type": "categorical", "choices": [...], "default": val}{"type": "categorical", "choices": ["l1", "l2"], "default": "l2"}
"boolean"{"type": "boolean", "default": val}{"type": "boolean", "default": true}

Parameter constraints

Constraints prevent invalid parameter combinations during hyperparameter search:

@register_model( name="MySVM", task_types=["binary", "multiclass"], params={ "penalty": {"type": "categorical", "choices": ["l1", "l2"], "default": "l2"}, "solver": {"type": "categorical", "choices": ["lbfgs", "saga"], "default": "lbfgs"}, }, constraints=[ { "description": "l1 penalty requires saga solver", "check": lambda p: p.get("penalty") != "l1" or p.get("solver") == "saga", } ], ) class MySVM(BaseEstimator, ClassifierMixin): ...

PyTorch models

For deep learning models, extend PyTorchBase and set framework="pytorch":

from mllabiome.ai_space.dl_subspace.custom import PyTorchBase from sklearn.base import ClassifierMixin import torch.nn as nn @register_model( name="TwoLayerNet", task_types=["binary", "multiclass"], params={ "hidden_dim": {"type": "int", "range": [32, 256], "default": 128}, "learning_rate": {"type": "loguniform", "range": [1e-4, 1e-2], "default": 1e-3}, "epochs": {"type": "int", "range": [10, 100], "default": 50}, }, framework="pytorch", ) class TwoLayerNet(PyTorchBase, ClassifierMixin): def __init__(self, hidden_dim=128, learning_rate=1e-3, epochs=50, random_state=42, **kwargs): super().__init__( hidden_dim=hidden_dim, learning_rate=learning_rate, epochs=epochs, random_state=random_state, **kwargs, ) def _build_model(self, n_features, n_classes): return nn.Sequential( nn.Linear(n_features, self.hidden_dim), nn.ReLU(), nn.Dropout(0.3), nn.Linear(self.hidden_dim, n_classes), ).to(self.device)

Using a custom model in an experiment

Import the registration module before the experiment runs so the decorator executes:

# my_models.py — defines and registers the model (as above) # experiment.py import my_models # triggers registration import mllabiome as mll config = mll.ExperimentConfiguration( ... models=[mll.XGBoost(n_estimators=1000), "WeightedKNN"], ... )

Custom and built-in models can be mixed freely in the same experiment.

Listing registered models

from mllabiome import get_all_models, get_classification_models, get_model_config print(get_all_models()) # all registered names print(get_classification_models()) # classification only config = get_model_config("WeightedKNN") print(config.params) # hyperparameter space
Last updated on