Experiment configuration
The experiment is defined in example/IBD/ibd_franzosa_mtb.py. It follows the same structure as the microbiome tutorial but with three configuration changes specific to generic tabular data.
Imports
from pathlib import Path
import mllabiome as mllFile paths and task definition
METABOLOMICS_FILE_PATH = Path("example/IBD/data/metabolomics_data.tsv")
METADATA_FILE_PATH = Path("example/IBD/data/metadata.tsv")
EXPERIMENT_DIR = Path("results/ibd_metabolomics")
SAMPLE_ID_COLUMN_NAME = "Sample"
TARGET_COLUMN_NAME = "Study.Group"
TASK_TYPE = mll.TaskType.MULTICLASS| Constant | Purpose |
|---|---|
METABOLOMICS_FILE_PATH | Feature TSV: first column Sample (sample IDs), remaining columns are metabolite abundances |
METADATA_FILE_PATH | Sample metadata TSV with one row per sample |
EXPERIMENT_DIR | Directory where all results and artefacts are written |
SAMPLE_ID_COLUMN_NAME | Column in both the features file and the metadata file that identifies samples |
TARGET_COLUMN_NAME | Column in the metadata file used as the prediction target |
TASK_TYPE | Learning task: mll.TaskType.MULTICLASS, mll.TaskType.BINARY, or mll.TaskType.REGRESSION |
Evaluation protocol
NESTED_CV_CONFIG = mll.NestedCVConfig(
outer_folds=5,
inner_folds=3,
repeats=2,
random_state=42,
stratify=True,
stratify_columns=[TARGET_COLUMN_NAME],
)
EVALUATION_THRESHOLDS = mll.EvaluationThresholds(
inner_val_performance_threshold=0.51,
inner_val_single_fold_performance_threshold=0.51,
)These settings are identical to the microbiome tutorial. See Evaluation protocol for a description.
Transformation
For this example metabolomics data there is no taxonomic hierarchy to filter, so taxonomic_configs is set to an empty list. A single pass-through transformation is used to keep the experiment fast:
TRANSFORMS_CONFIGS = [
mll.TransformationConfig(
transform_type=mll.TransformationType.NONE,
normalize_to_relative=False,
),
]For all available transformation types, see Transformation types.
Models
MODEL_CONFIGS = [
mll.XGBoost(n_estimators=100),
mll.RandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=91),
mll.LogisticRegression(max_iter=100),
]For the full list of supported models, see Available base learners.
Assembling the configuration
The three differences from the microbiome tutorial are highlighted below:
config = mll.ExperimentConfiguration(
primary_data_file=METABOLOMICS_FILE_PATH, # not microbiome_file
metadata_file=METADATA_FILE_PATH,
experiment_dir=EXPERIMENT_DIR,
sample_id_column=SAMPLE_ID_COLUMN_NAME,
features_are_rows=False, # samples are rows, features are columns
taxonomic_configs=[], # no taxonomy
transform_configs=TRANSFORMS_CONFIGS,
target_configs=[
mll.TargetConfig(
column=TARGET_COLUMN_NAME,
task_type=TASK_TYPE,
)
],
primary_modality_models=MODEL_CONFIGS,
...
)| Parameter | Microbiome tutorial | This tutorial |
|---|---|---|
microbiome_file / primary_data_file | microbiome_file=... | primary_data_file=... |
features_are_rows | True (taxa are rows) | False (samples are rows) |
taxonomic_configs | list of TaxonomicProcessingConfig | [] |
primary_data_file is the parameter for any non-microbiome primary modality. microbiome_file is the legacy name for the same concept and remains accepted for backward compatibility.
features_are_rows=False tells the loader that the file has one sample per row. The data is transposed internally before any transformation or model training, so the rest of the pipeline sees the standard features-as-rows layout.
taxonomic_configs=[] disables the taxonomic filtering step entirely. With no taxonomy to traverse, the transformation search runs directly on the full feature matrix.