Experiment configuration

The experiment is defined in a single Python script. For the IBD example this is example/IBD/ibd_franzosa.py. The script constructs an ExperimentConfiguration object and passes it to mll.Evaluator.

Imports

Every experiment script starts with the same two lines:


from pathlib import Path
import mllabiome as mll

Path handles file paths in an OS-independent way. mllabiome is imported as mll by convention. All framework classes are accessed through this alias.

File paths and task definition

The top-level constants declare the input files, output directory, and the column names that link the two files together. Replace all paths and column names here with values that match your own dataset. Nothing else in the script needs to change for a straightforward swap to a different study.


MICROBIOME_FILE_PATH  = Path("example/IBD/data/FRANZOSA_IBD_2019_profiles_hierarchical.tsv")
METADATA_FILE_PATH    = Path("example/IBD/data/metadata.tsv")
EXPERIMENT_DIR        = Path("results/ibd_franzosa")
SAMPLE_ID_COLUMN_NAME = "Sample"
TARGET_COLUMN_NAME    = "Study.Group"
TASK_TYPE             = mll.TaskType.MULTICLASS

Constant	Purpose
`MICROBIOME_FILE_PATH`	Profiles TSV: first column `clade_name` (taxa), remaining columns are sample IDs
`METADATA_FILE_PATH`	Sample metadata TSV with one row per sample
`EXPERIMENT_DIR`	Directory where all results, model weights, and logs are written
`SAMPLE_ID_COLUMN_NAME`	Column in the metadata file that contains sample identifiers. Must match the column headers of the profiles file
`TARGET_COLUMN_NAME`	Column in the metadata file used as the prediction target
`TASK_TYPE`	Learning task: `mll.TaskType.MULTICLASS`, `mll.TaskType.BINARY`, or `mll.TaskType.REGRESSION`

For the expected file formats and how to prepare your own data, see Data preparation.

Evaluation protocol

These two constants define the cross-validation strategy and quality gates. They are shared across all experiment configurations defined later in the script.


NESTED_CV_CONFIG = mll.NestedCVConfig(
    outer_folds=5,
    inner_folds=3,
    repeats=2,
    random_state=42,
    stratify=True,
    stratify_columns=[TARGET_COLUMN_NAME],
)
 
EVALUATION_THRESHOLDS = mll.EvaluationThresholds(
    inner_val_performance_threshold=0.51,
    inner_val_single_fold_performance_threshold=0.51,
)

NestedCVConfig sets up a repeated nested cross-validation loop. In this example, the outer loop runs 5 folds with 2 repeats to estimate generalisation performance, and the inner loop runs 3 folds for model selection. Stratification is enabled to keep class proportions balanced across folds. All of these values are configurable.

EvaluationThresholds sets minimum performance gates on the inner validation. Any configuration that does not exceed inner_val_performance_threshold on average, or inner_val_single_fold_performance_threshold on any single fold, is dropped before reaching the outer evaluation. Dropped configurations are neither evaluated further nor saved to disk, so raising these values when a strong signal is expected can meaningfully reduce both compute time and output size. The thresholds are fully configurable and can be adjusted when resuming an interrupted experiment. The values used here (0.51) are intentionally permissive for this example.

Configuration search space

The four lists below (taxonomic resolutions, transformations, projections, and base learners) define the search space. Every combination is evaluated independently. Each list can contain one entry or many, in any order, and any entry can be removed or swapped without affecting the rest of the script. The evaluation protocol and thresholds defined above apply uniformly to all combinations.

Taxonomic resolution

taxonomic_configs accepts a list of TaxonomicProcessingConfig objects. Each entry defines a different taxonomic slice of the profiles file to evaluate. The IBD example uses four specific configurations chosen to compare ranges and single levels relevant to this dataset. They are not an exhaustive search. For the full set of available levels, the aggregate parameter, and a ready-made collection of predefined configurations, see Taxonomic levels.


TAXONOMIC_RESOLUTIONS_CONFIGS = [
    mll.TaxonomicProcessingConfig.filter_range(
        start_level=mll.TaxonomicLevel.ORDER,
        end_level=mll.TaxonomicLevel.SPECIES,
        aggregate=False,
    ),
    mll.TaxonomicProcessingConfig.filter_range(
        start_level=mll.TaxonomicLevel.PHYLUM,
        end_level=mll.TaxonomicLevel.ORDER,
        aggregate=False,
    ),
    mll.TaxonomicProcessingConfig.filter_exact(
        level=mll.TaxonomicLevel.GENUS,
    ),
    mll.TaxonomicProcessingConfig.filter_exact(
        level=mll.TaxonomicLevel.GENUS, exclude_patterns=["Chloroplast"]
    ),
]

filter_range keeps all levels between start_level and end_level inclusive. filter_exact keeps only a single level and optionally drops clades whose name matches any of the exclude_patterns strings.

Transformation

transform_configs specifies how the count data is preprocessed before model training. This experiment compares three transformations:


TRANSFORMS_CONFIGS = [
    mll.TransformationConfig(
        transform_type=mll.TransformationType.NONE,
        normalize_to_relative=False,
    ),
    mll.TransformationConfig(
        transform_type=mll.TransformationType.BINARY,
        normalize_to_relative=False,
    ),
    mll.TransformationConfig(
        transform_type=mll.TransformationType.ARCSIN_SQRT,
        normalize_to_relative=True,
    ),
]

NONE passes counts through without modification. BINARY converts all non-zero values to 1, encoding presence/absence. ARCSIN_SQRT with normalize_to_relative=True first converts counts to relative abundances, then applies an arc-sine square-root transformation, a standard variance-stabilising step for compositional data. For all available types and their descriptions, see Transformation types.

Projection

projection_configs accepts a list of ProjectionConfig objects and adds a dimensionality reduction step after transformation. This experiment does not use projection. See Projection methods for available options.

Target

target_configs defines the prediction task. Here it is a three-class classification on the Study.Group column:


target_configs=[
    mll.TargetConfig(
        column=TARGET_COLUMN_NAME,
        task_type=TASK_TYPE,
    )
],

Replace TARGET_COLUMN_NAME with the name of the target column in your metadata file. Use mll.TaskType to specify the task. Supported values are mll.TaskType.MULTICLASS, mll.TaskType.BINARY, and mll.TaskType.REGRESSION.

Models

primary_modality_models lists the base learners to evaluate. This experiment compares three:


MODEL_CONFIGS = [
    mll.BernoulliNB(),
    mll.XGBoost(n_estimators=1000),
    mll.RandomForestClassifier(n_estimators=1000, min_samples_leaf=5, random_state=91),
]

BernoulliNB is a Naive Bayes classifier suited to binary (presence/absence) features. XGBoost and RandomForestClassifier are gradient-boosted and bagged tree ensembles respectively. Constructor arguments are passed directly as model hyperparameters. For the full list of available base learners organised by family, see Microbiome profile modelling algorithm (MPMA).

Hyperparameter optimisation

hyperopt_config controls whether hyperparameter search is run inside the inner loop. It is disabled in this example to keep runtime short:


hyperopt_config=mll.HyperoptConfig(
    enabled=False,
    ...
),

Set enabled=True to activate Optuna-based search with the configured sampler and pruner.