Transformation types
A TransformationConfig specifies how raw abundance values are transformed before they are passed to a model. The transform_type field selects the mathematical operation applied to each sample, and normalize_to_relative controls whether counts are converted to relative abundances before that operation.
import mllabiome as mll
mll.TransformationConfig(
transform_type=mll.TransformationType.ARCSIN_SQRT,
normalize_to_relative=True,
)Available transformations
Identity and basic
| Type | Description |
|---|---|
NONE | No transformation. Raw counts or relative abundances are used as-is. |
TSS | Total sum scaling. Each sample is divided by its total, producing relative abundances. |
BINARY | Presence/absence encoding. All non-zero values are set to 1. |
Log-ratio transformations (compositional data)
These transformations are designed for compositional data where absolute values carry no meaning and only ratios between components are informative.
| Type | Description |
|---|---|
CLR | Centred log-ratio. Each value is divided by the geometric mean of the sample, then log-transformed. |
ALR | Additive log-ratio. Each value is expressed as a log-ratio relative to a reference component. |
ILR | Isometric log-ratio. Projects the composition into an unconstrained Euclidean space via an orthonormal basis. |
Root and trigonometric transformations
| Type | Description |
|---|---|
SQRT | Square root of each value. |
HELLINGER | Square root of relative abundances. Equivalent to the Hellinger transformation used in ordination. |
ARCSIN | Arc-sine of each value. |
ARCSIN_SQRT | Arc-sine of the square root of each value. A variance-stabilising transformation for proportions. |
Log transformations
| Type | Description |
|---|---|
LOG | Natural logarithm. A small pseudocount is added to handle zeros. |
LOG10 | Base-10 logarithm with pseudocount. |
LOG2 | Base-2 logarithm with pseudocount. |
Rank-based normalisation
These transformations were identified as strong performers in benchmarking studies of microbiome classification pipelines.
| Type | Description |
|---|---|
RANK_STD | Feature ranks followed by z-score standardisation. |
RANK_UNIT | Feature ranks divided by the square root of the sum of squared ranks. |
LOG_STD | Natural log followed by z-score standardisation. |
LOG_UNIT | Natural log followed by unit-norm scaling. |
ZSCORE | Z-score standardisation of raw values without prior rank or log step. |
Variance-stabilising transformations
These transformations use batch-level statistics fitted on training data. They are applied after fitting a scaler across the training set and are not purely per-sample operations.
| Type | Description |
|---|---|
POWER | Yeo-Johnson power transformation. Stabilises variance and reduces skewness. |
BOXCOX | Box-Cox transformation. Requires strictly positive values. |
ROBUST | Robust scaling using median and interquartile range. Less sensitive to outliers than z-score. |
QUANTILE | Quantile transformation mapping values to a uniform or normal distribution. |
Special-purpose transformations
| Type | Description |
|---|---|
CHI_SQUARE | Chi-square scaling, transforming each feature by its chi-square statistic relative to the sample total. |
MGM_ENCODING | Neural encoder output from a pre-trained MGM model. Requires a compatible model checkpoint and is not interchangeable with the transformations above. Requires the mgm optional dependency — install with uv pip install -e ".[mgm]" (see Optional extras). |
Predefined configuration set
operations/transforms_configs.py provides a ready-to-use list covering 23 transformation types, each evaluated both with and without prior relative-abundance normalisation. MGM_ENCODING is excluded as it requires a separate model checkpoint.
from operations.transforms_configs import transforms_configs
config = mll.ExperimentConfiguration(
...
transformation_configs=transforms_configs,
...
)Custom transformations
New transformations can be added at runtime using the @register_transformation decorator. Once registered, a custom transformation is available through TransformationConfig and participates in the experiment sweep alongside built-in types.
Row-wise transformations
A row-wise transformation processes each sample independently. Subclass BaseTransformer and implement transform:
import numpy as np
import pandas as pd
from mllabiome import register_transformation
from mllabiome.data_space.processing.compositional_transformation import BaseTransformer
@register_transformation(
name="robust_clr",
description="CLR with median centering instead of mean",
)
class RobustCLRTransformer(BaseTransformer):
def transform(self, sample: pd.DataFrame) -> pd.DataFrame:
values = sample.values.flatten()
values = np.where(values == 0, 1e-10, values)
log_values = np.log(values)
centered = log_values - np.median(log_values)
return pd.DataFrame(
centered.reshape(1, -1),
index=sample.index,
columns=sample.columns,
)The sample argument is a single-row DataFrame (1 x n_features). The returned DataFrame must preserve the same index and columns.
Batch transformations
A batch transformation fits parameters on training data and applies the learned transformation to both train and test sets. Subclass BatchTransformer and implement fit_transform_batch:
from mllabiome.data_space.processing.compositional_transformation import BatchTransformer
@register_transformation(
name="percentile_norm",
description="Percentile-based normalization fitted on training data",
is_batch=True,
)
class PercentileNormTransformer(BatchTransformer):
def __init__(self, lower: float = 5, upper: float = 95):
self.lower = lower
self.upper = upper
def fit_transform_batch(
self,
X_train: pd.DataFrame,
X_test: pd.DataFrame,
) -> tuple[pd.DataFrame, pd.DataFrame]:
p_low = np.percentile(X_train.values, self.lower, axis=0)
p_high = np.percentile(X_train.values, self.upper, axis=0)
range_ = p_high - p_low
range_[range_ == 0] = 1.0
X_train_norm = (X_train - p_low) / range_
X_test_norm = (X_test - p_low) / range_
return X_train_norm.clip(0, 1), X_test_norm.clip(0, 1)Decorator parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | required | Unique identifier used in TransformationConfig. |
description | str | "" | Human-readable summary. |
is_batch | bool | False | Whether the transform requires fitting on training data. |
formula | str | "" | Optional LaTeX formula for documentation. |
requires_positive | bool | False | Whether input values must be positive. |
changes_dimensions | bool | False | Whether the output has a different number of features than the input. |
Using a custom transformation in an experiment
The registration module must be imported before the experiment runs so the decorator executes:
# my_transforms.py — defines and registers the transformation (as above)
# experiment.py
import my_transforms # triggers registration
import mllabiome as mll
config = mll.ExperimentConfiguration(
...
transformation_configs=[
mll.TransformationConfig(transform_type="robust_clr", normalize_to_relative=True),
mll.TransformationConfig(transform_type=mll.TransformationType.CLR),
],
...
)Custom and built-in transformations can be mixed freely in the same experiment.
Listing registered transformations
from mllabiome import get_all_transformations, get_row_transformations, get_batch_transformations
print(get_all_transformations()) # all registered names
print(get_row_transformations()) # row-wise only
print(get_batch_transformations()) # batch only