Microbiome profile modelling algorithm (MPMA)

An MPMA is a complete machine learning pipeline: an MPDR (taxonomic resolution + transformation + optional projection) combined with a base learner.

Available base learners

Base learners are passed as instances of their config class. Constructor arguments override the default hyperparameter values. For example:


import mllabiome as mll
 
mll.XGBoost(n_estimators=1000)
mll.RandomForestClassifier(n_estimators=500, min_samples_leaf=3)
mll.BernoulliNB()

operations/learners_configs.py provides a ready-to-use list of all available base learners with default parameters. The full set is organised by family below. Classes whose name ends in Classifier are classification-only; classes ending in Regressor are regression-only; unmarked classes support both tasks unless noted.

Linear models

Ridge · Lasso · ElasticNet · SGDRegressor · PassiveAggressiveRegressor · HuberRegressor · LogisticRegression · RidgeClassifier · RidgeClassifierCV · SGDClassifier · PassiveAggressiveClassifier

Tree-based models

DecisionTreeRegressor · DecisionTreeClassifier · RandomForestRegressor · RandomForestClassifier · ExtraTreesRegressor · ExtraTreesClassifier · GradientBoostingRegressor · GradientBoostingClassifier · HistGradientBoostingRegressor · HistGradientBoostingClassifier

Boosting models

XGBoost · LightGBM · CatBoost · AdaBoostRegressor · AdaBoostClassifier

SVM models

SVR · SVC · NuSVR · NuSVC · LinearSVR · LinearSVC

Neighbor models

KNeighborsRegressor · KNeighborsClassifier · NearestCentroid

Naive Bayes models

GaussianNB · BernoulliNB

Discriminant analysis models

LinearDiscriminantAnalysis · QuadraticDiscriminantAnalysis

Ensemble method models

BaggingRegressor · BaggingClassifier

Neural network models

MLPRegressor · MLPClassifier · SimpleNN · WideNN · TabPFN · TabICL · SCARF · DeepMicroAE · DeepMicroVAE · DeepMicroCAE · PhyloFormer

Probabilistic and mixture models

GaussianNB · GMMClassifier · BayesianGaussianMixtureClassifier · FactorAnalysis (unsupervised)

Semi-supervised and AutoML

SelfTraining · FLAMLClassifier · TabPFNClassifier · TabICLClassifier

Prompting

OllamaClassifier

Clustering models

DBSCAN · HDBSCAN · MeanShift · Birch · AgglomerativeClustering · SpectralClustering

Custom models

New models can be added at runtime using the @register_model decorator. Once registered, a custom model is available through ExperimentConfiguration and participates in hyperparameter optimisation alongside built-in learners.

Registering a scikit-learn compatible model

Subclass BaseEstimator and ClassifierMixin (or RegressorMixin), then decorate the class:


import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from mllabiome import register_model
 
@register_model(
    name="WeightedKNN",
    task_types=["binary", "multiclass"],
    params={
        "n_neighbors": {"type": "int", "range": [3, 30], "default": 7},
        "weight_power": {"type": "float", "range": [0.5, 3.0], "default": 1.0},
    },
)
class WeightedKNN(BaseEstimator, ClassifierMixin):
    def __init__(self, n_neighbors=7, weight_power=1.0, random_state=42):
        self.n_neighbors = n_neighbors
        self.weight_power = weight_power
        self.random_state = random_state
 
    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.X_train_ = X
        self.y_train_ = y
        return self
 
    def predict(self, X):
        from sklearn.metrics import pairwise_distances
 
        dists = pairwise_distances(X, self.X_train_)
        idx = np.argsort(dists, axis=1)[:, : self.n_neighbors]
        neighbor_labels = self.y_train_[idx]
        return np.array(
            [np.bincount(row, minlength=len(self.classes_)).argmax() for row in neighbor_labels]
        )
 
    def predict_proba(self, X):
        from sklearn.metrics import pairwise_distances
 
        dists = pairwise_distances(X, self.X_train_)
        idx = np.argsort(dists, axis=1)[:, : self.n_neighbors]
        weights = 1.0 / (dists[np.arange(len(X))[:, None], idx] ** self.weight_power + 1e-10)
        proba = np.zeros((len(X), len(self.classes_)))
        for c_idx, c in enumerate(self.classes_):
            mask = self.y_train_[idx] == c
            proba[:, c_idx] = (weights * mask).sum(axis=1)
        return proba / proba.sum(axis=1, keepdims=True)

Key requirements:

Store every constructor parameter as an attribute with the same name (scikit-learn convention).
Set self.classes_ during fit.
Return self from fit.
Implement predict_proba for classifiers used in ensemble voting or calibration.

Decorator parameters

Parameter	Type	Default	Description
`name`	`str`	required	Unique identifier used in experiment configs.
`task_types`	`list[str]`	all	Supported tasks: `"binary"`, `"multiclass"`, `"regression"`, `"multilabel"`.
`params`	`dict`	`{}`	Hyperparameter search space for optimisation.
`framework`	`str`	`"sklearn"`	`"sklearn"`, `"pytorch"`, or `"xgboost"`.
`constraints`	`list[dict]`	`[]`	Parameter constraints (see below).
`description`	`str`	`""`	Human-readable summary.

Hyperparameter types

Type	Format	Example
`"int"`	`{"type": "int", "range": [lo, hi], "default": val}`	`{"type": "int", "range": [10, 200], "default": 100}`
`"float"`	`{"type": "float", "range": [lo, hi], "default": val}`	`{"type": "float", "range": [0.0, 1.0], "default": 0.5}`
`"loguniform"`	`{"type": "loguniform", "range": [lo, hi], "default": val}`	`{"type": "loguniform", "range": [1e-4, 0.1], "default": 1e-3}`
`"categorical"`	`{"type": "categorical", "choices": [...], "default": val}`	`{"type": "categorical", "choices": ["l1", "l2"], "default": "l2"}`
`"boolean"`	`{"type": "boolean", "default": val}`	`{"type": "boolean", "default": true}`

Parameter constraints

Constraints prevent invalid parameter combinations during hyperparameter search:


@register_model(
    name="MySVM",
    task_types=["binary", "multiclass"],
    params={
        "penalty": {"type": "categorical", "choices": ["l1", "l2"], "default": "l2"},
        "solver": {"type": "categorical", "choices": ["lbfgs", "saga"], "default": "lbfgs"},
    },
    constraints=[
        {
            "description": "l1 penalty requires saga solver",
            "check": lambda p: p.get("penalty") != "l1" or p.get("solver") == "saga",
        }
    ],
)
class MySVM(BaseEstimator, ClassifierMixin):
    ...

PyTorch models

For deep learning models, extend PyTorchBase and set framework="pytorch":


from mllabiome.ai_space.dl_subspace.custom import PyTorchBase
from sklearn.base import ClassifierMixin
import torch.nn as nn
 
@register_model(
    name="TwoLayerNet",
    task_types=["binary", "multiclass"],
    params={
        "hidden_dim": {"type": "int", "range": [32, 256], "default": 128},
        "learning_rate": {"type": "loguniform", "range": [1e-4, 1e-2], "default": 1e-3},
        "epochs": {"type": "int", "range": [10, 100], "default": 50},
    },
    framework="pytorch",
)
class TwoLayerNet(PyTorchBase, ClassifierMixin):
    def __init__(self, hidden_dim=128, learning_rate=1e-3, epochs=50, random_state=42, **kwargs):
        super().__init__(
            hidden_dim=hidden_dim,
            learning_rate=learning_rate,
            epochs=epochs,
            random_state=random_state,
            **kwargs,
        )
 
    def _build_model(self, n_features, n_classes):
        return nn.Sequential(
            nn.Linear(n_features, self.hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(self.hidden_dim, n_classes),
        ).to(self.device)

Using a custom model in an experiment

Import the registration module before the experiment runs so the decorator executes:


# my_models.py — defines and registers the model (as above)
 
# experiment.py
import my_models  # triggers registration
import mllabiome as mll
 
config = mll.ExperimentConfiguration(
    ...
    models=[mll.XGBoost(n_estimators=1000), "WeightedKNN"],
    ...
)

Custom and built-in models can be mixed freely in the same experiment.

Listing registered models


from mllabiome import get_all_models, get_classification_models, get_model_config
 
print(get_all_models())              # all registered names
print(get_classification_models())   # classification only
 
config = get_model_config("WeightedKNN")
print(config.params)                 # hyperparameter space