Microbiome profile modelling algorithm (MPMA)
An MPMA is a complete machine learning pipeline: an MPDR (taxonomic resolution + transformation + optional projection) combined with a base learner.
Available base learners
Base learners are passed as instances of their config class. Constructor arguments override the default hyperparameter values. For example:
import mllabiome as mll
mll.XGBoost(n_estimators=1000)
mll.RandomForestClassifier(n_estimators=500, min_samples_leaf=3)
mll.BernoulliNB()operations/learners_configs.py provides a ready-to-use list of all available base learners with default parameters. The full set is organised by family below. Classes whose name ends in Classifier are classification-only; classes ending in Regressor are regression-only; unmarked classes support both tasks unless noted.
Linear models
Ridge · Lasso · ElasticNet · SGDRegressor · PassiveAggressiveRegressor · HuberRegressor · LogisticRegression · RidgeClassifier · RidgeClassifierCV · SGDClassifier · PassiveAggressiveClassifier
Tree-based models
DecisionTreeRegressor · DecisionTreeClassifier · RandomForestRegressor · RandomForestClassifier · ExtraTreesRegressor · ExtraTreesClassifier · GradientBoostingRegressor · GradientBoostingClassifier · HistGradientBoostingRegressor · HistGradientBoostingClassifier
Boosting models
XGBoost · LightGBM · CatBoost · AdaBoostRegressor · AdaBoostClassifier
SVM models
SVR · SVC · NuSVR · NuSVC · LinearSVR · LinearSVC
Neighbor models
KNeighborsRegressor · KNeighborsClassifier · NearestCentroid
Naive Bayes models
GaussianNB · BernoulliNB
Discriminant analysis models
LinearDiscriminantAnalysis · QuadraticDiscriminantAnalysis
Ensemble method models
BaggingRegressor · BaggingClassifier
Neural network models
MLPRegressor · MLPClassifier · SimpleNN · WideNN · TabPFN · TabICL · SCARF · DeepMicroAE · DeepMicroVAE · DeepMicroCAE · PhyloFormer
Probabilistic and mixture models
GaussianNB · GMMClassifier · BayesianGaussianMixtureClassifier · FactorAnalysis (unsupervised)
Semi-supervised and AutoML
SelfTraining · FLAMLClassifier · TabPFNClassifier · TabICLClassifier
Prompting
OllamaClassifier
Clustering models
DBSCAN · HDBSCAN · MeanShift · Birch · AgglomerativeClustering · SpectralClustering
Custom models
New models can be added at runtime using the @register_model decorator. Once registered, a custom model is available through ExperimentConfiguration and participates in hyperparameter optimisation alongside built-in learners.
Registering a scikit-learn compatible model
Subclass BaseEstimator and ClassifierMixin (or RegressorMixin), then decorate the class:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from mllabiome import register_model
@register_model(
name="WeightedKNN",
task_types=["binary", "multiclass"],
params={
"n_neighbors": {"type": "int", "range": [3, 30], "default": 7},
"weight_power": {"type": "float", "range": [0.5, 3.0], "default": 1.0},
},
)
class WeightedKNN(BaseEstimator, ClassifierMixin):
def __init__(self, n_neighbors=7, weight_power=1.0, random_state=42):
self.n_neighbors = n_neighbors
self.weight_power = weight_power
self.random_state = random_state
def fit(self, X, y):
self.classes_ = np.unique(y)
self.X_train_ = X
self.y_train_ = y
return self
def predict(self, X):
from sklearn.metrics import pairwise_distances
dists = pairwise_distances(X, self.X_train_)
idx = np.argsort(dists, axis=1)[:, : self.n_neighbors]
neighbor_labels = self.y_train_[idx]
return np.array(
[np.bincount(row, minlength=len(self.classes_)).argmax() for row in neighbor_labels]
)
def predict_proba(self, X):
from sklearn.metrics import pairwise_distances
dists = pairwise_distances(X, self.X_train_)
idx = np.argsort(dists, axis=1)[:, : self.n_neighbors]
weights = 1.0 / (dists[np.arange(len(X))[:, None], idx] ** self.weight_power + 1e-10)
proba = np.zeros((len(X), len(self.classes_)))
for c_idx, c in enumerate(self.classes_):
mask = self.y_train_[idx] == c
proba[:, c_idx] = (weights * mask).sum(axis=1)
return proba / proba.sum(axis=1, keepdims=True)Key requirements:
- Store every constructor parameter as an attribute with the same name (scikit-learn convention).
- Set
self.classes_duringfit. - Return
selffromfit. - Implement
predict_probafor classifiers used in ensemble voting or calibration.
Decorator parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | required | Unique identifier used in experiment configs. |
task_types | list[str] | all | Supported tasks: "binary", "multiclass", "regression", "multilabel". |
params | dict | {} | Hyperparameter search space for optimisation. |
framework | str | "sklearn" | "sklearn", "pytorch", or "xgboost". |
constraints | list[dict] | [] | Parameter constraints (see below). |
description | str | "" | Human-readable summary. |
Hyperparameter types
| Type | Format | Example |
|---|---|---|
"int" | {"type": "int", "range": [lo, hi], "default": val} | {"type": "int", "range": [10, 200], "default": 100} |
"float" | {"type": "float", "range": [lo, hi], "default": val} | {"type": "float", "range": [0.0, 1.0], "default": 0.5} |
"loguniform" | {"type": "loguniform", "range": [lo, hi], "default": val} | {"type": "loguniform", "range": [1e-4, 0.1], "default": 1e-3} |
"categorical" | {"type": "categorical", "choices": [...], "default": val} | {"type": "categorical", "choices": ["l1", "l2"], "default": "l2"} |
"boolean" | {"type": "boolean", "default": val} | {"type": "boolean", "default": true} |
Parameter constraints
Constraints prevent invalid parameter combinations during hyperparameter search:
@register_model(
name="MySVM",
task_types=["binary", "multiclass"],
params={
"penalty": {"type": "categorical", "choices": ["l1", "l2"], "default": "l2"},
"solver": {"type": "categorical", "choices": ["lbfgs", "saga"], "default": "lbfgs"},
},
constraints=[
{
"description": "l1 penalty requires saga solver",
"check": lambda p: p.get("penalty") != "l1" or p.get("solver") == "saga",
}
],
)
class MySVM(BaseEstimator, ClassifierMixin):
...PyTorch models
For deep learning models, extend PyTorchBase and set framework="pytorch":
from mllabiome.ai_space.dl_subspace.custom import PyTorchBase
from sklearn.base import ClassifierMixin
import torch.nn as nn
@register_model(
name="TwoLayerNet",
task_types=["binary", "multiclass"],
params={
"hidden_dim": {"type": "int", "range": [32, 256], "default": 128},
"learning_rate": {"type": "loguniform", "range": [1e-4, 1e-2], "default": 1e-3},
"epochs": {"type": "int", "range": [10, 100], "default": 50},
},
framework="pytorch",
)
class TwoLayerNet(PyTorchBase, ClassifierMixin):
def __init__(self, hidden_dim=128, learning_rate=1e-3, epochs=50, random_state=42, **kwargs):
super().__init__(
hidden_dim=hidden_dim,
learning_rate=learning_rate,
epochs=epochs,
random_state=random_state,
**kwargs,
)
def _build_model(self, n_features, n_classes):
return nn.Sequential(
nn.Linear(n_features, self.hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(self.hidden_dim, n_classes),
).to(self.device)Using a custom model in an experiment
Import the registration module before the experiment runs so the decorator executes:
# my_models.py — defines and registers the model (as above)
# experiment.py
import my_models # triggers registration
import mllabiome as mll
config = mll.ExperimentConfiguration(
...
models=[mll.XGBoost(n_estimators=1000), "WeightedKNN"],
...
)Custom and built-in models can be mixed freely in the same experiment.
Listing registered models
from mllabiome import get_all_models, get_classification_models, get_model_config
print(get_all_models()) # all registered names
print(get_classification_models()) # classification only
config = get_model_config("WeightedKNN")
print(config.params) # hyperparameter space