Skip to Content

Data preparation

Example data download

The dataset is drawn from the Borenstein Lab curated microbiome data collection .

Run the provided download script from the project root:

python example/IBD/data/download_mtb_data.py

This downloads mtb.tsv (metabolite abundances) and metadata.tsv (sample labels) to example/IBD/data/.

FilePath
metadata.tsv example/IBD/data/metadata.tsv
mtb.tsv example/IBD/data/mtb.tsv

Data preparation

The raw mtb.tsv file uses compound cluster IDs as column names (e.g. C18-neg_Cluster_0001: NA). The preparation step renames these to a consistent met_XXXX format, producing feature names that are safe and predictable across tools.

Run the preparation script from the project root:

python example/IBD/data/prepare_mtb_data.py

The script reads mtb.tsv, renames the compound columns to met_0000 through met_8847, and writes the result to example/IBD/data/metabolomics_data.tsv.

Input format

mllabiome expects two TSV files: a features file and a metadata file.

Features file

For generic (non-microbiome) tabular data, samples are rows and features are columns. The first column must contain sample identifiers. The file must have a header row.

Sample met_0000 met_0001 met_0002 PRISM.7122 158.43 0 203.78 PRISM.7147 0 47.21 89.33 ...

This orientation corresponds to features_are_rows=False in the experiment configuration. The column named by sample_id_column is used as the sample identifier; its values must match the corresponding column in the metadata file.

Metadata file

Rows are samples, columns are clinical or experimental variables. The file must contain a sample identifier column whose values match the sample identifier column in the features file, and at least one column to use as the prediction target.

For this dataset the relevant columns are Sample (identifier) and Study.Group (target, with values CD, UC, and nonIBD).

Last updated on