Data preparation

The first two sections cover downloading and converting the example dataset. When working with custom data, the Input format section contains the file format specification. Following the example first is recommended for a more concrete understanding of the expected structure.

Example data download

The dataset used in this tutorial is from the Borenstein Lab curated microbiome data collection .

The files can be downloaded automatically by running the provided script from the project root:


python example/IBD/data/download_data.py

Alternatively, download each file manually and place it at the path shown.

File	Path
metadata.tsv	`example/IBD/data/metadata.tsv`
species.counts.tsv	`example/IBD/data/species.counts.tsv`
genera.counts.tsv	`example/IBD/data/genera.counts.tsv`


example/IBD/
├── data/
│   ├── download_data.py
│   ├── metadata.tsv
│   ├── species.counts.tsv
│   └── genera.counts.tsv
└── ibd_franzosa.py

If using your own data, replace these files with your cohort’s sample-level metadata and feature table in the same TSV format. The metadata file must contain a column that serves as the prediction target. The feature table must have samples as rows and features as columns.

Profile conversion

In the downloaded dataset, the raw count files store taxonomy as semicolon-delimited strings (e.g. d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus oralis_S) and are split across two files, one per taxonomic resolution. Before running an experiment these must be merged into a single profile table where each row is a fully qualified lineage path.

The reason for building a multi-resolution table is that mllabiome can operate at any taxonomic granularity, or across several at once. Stacking species, genus, family, order, class, phylum, and domain rows into one file retains full flexibility to configure the experiment at whichever resolution fits the research question, without reprocessing the data.

mllabiome does not require a hierarchical profile. A table with rows at a single taxonomic level, a subset of levels, or any combination is equally valid. The hierarchical format is a convenient starting point when experimenting across resolutions.

The repository includes a conversion script for this dataset. Run it from the project root:


python example/IBD/data/prepare_hierarchical_profiles.py

The script loads species.counts.tsv and genera.counts.tsv, transposes them so that taxa are rows and samples are columns, aggregates genus counts upward through family, order, class, phylum, and domain, converts all taxonomy strings to triple-underscore-separated lineage paths, and writes the combined result to:


example/IBD/data/FRANZOSA_IBD_2019_profiles_hierarchical.tsv

Each row index looks like: d__Bacteria___p__Firmicutes___c__Bacilli___o__Lactobacillales___f__Streptococcaceae___g__Streptococcus___s__Streptococcus oralis_S

Triple underscores serve as the level separator because several models in the search space, including XGBoost, reject feature names that contain special characters such as brackets or colons. The triple-underscore delimiter is unambiguous and safe across all supported backends.

This script is specific to the Franzosa IBD 2019 dataset. When working with different data, an equivalent TSV must be prepared with samples as columns, taxa at the desired resolution as rows, and the same triple-underscore lineage notation. The next section describes the expected input format in detail.

Input format

mllabiome expects two TSV files: a profiles file and a metadata file.

Profiles file

Rows are taxa, columns are samples. The first column is the row index and must be named clade_name. Each value is a triple-underscore-separated lineage path ending at the desired taxonomic level.


clade_name                     PRISM.7122   PRISM.7147   PRISM.7150
d__Bacteria                    22994906     86430469     75450907
d__Bacteria___p__Firmicutes_A  6758877      60132157     21182814
...

Row order is unconstrained. Count values, relative abundances, or any numeric representation are all accepted. The only structural requirement is that each taxon name uses triple underscores as the level separator. Preprocessing and normalisation are configured within the experiment, not at the file level.

Metadata file

Rows are samples, columns are clinical or experimental variables. The file must contain a sample identifier column whose values match the column headers in the profiles file, and at least one column to use as the prediction target.


Sample      Study.Group   Age
PRISM.7122  CD            38
PRISM.7147  CD            50
...

Column names and column order are not prescribed. The sample identifier column and the target column are specified in the experiment configuration, so any naming convention is acceptable. In this experiment the sample identifier is Sample and the target is Study.Group, which holds three class labels: CD, UC, and nonIBD.