Skip to Content

Packaging models

This page covers retraining the IBD ensemble on the full dataset and saving it in a format the backend can load.

The deployment script

The complete workflow is implemented in example/IBD/deploy_ibd.py:

example/IBD/deploy_ibd.py
from mllabiome.production import ProductionPipeline # --- Paths (same data used in the experiment) --- EXPERIMENT_DIR = "results/ibd_franzosa" ENSEMBLE_SUMMARY = "ensemble/ensemble_summary.json" PROFILE_PATH = "example/IBD/data/FRANZOSA_IBD_2019_profiles_hierarchical.tsv" METADATA_PATH = "example/IBD/data/metadata.tsv" # --- Load ensemble configuration --- pipeline = ProductionPipeline( experiment_dir=EXPERIMENT_DIR, ensemble_summary_path=ENSEMBLE_SUMMARY, ) # --- Retrain all ensemble members on the full dataset --- pipeline.train_production_models( train_data_path=PROFILE_PATH, train_targets_path=METADATA_PATH, sample_id_column="Sample", target_column="Study.Group", save_dir="app/backend/production_models/ibd_franzosa", ) # --- Quick sanity check: predict on the training data itself --- predictions = pipeline.predict( test_data_path=PROFILE_PATH, sample_id_column="Sample", return_probabilities=True, ) print("\nSample predictions (first 5):") print(predictions.head()) print(f"\nTotal samples predicted: {len(predictions)}")

Run it from the repository root:

python example/IBD/deploy_ibd.py

The rest of this page walks through each part of the script in detail.


Step-by-step breakdown

Loading the ensemble configuration

ProductionPipeline reads the experiment directory and locates ensemble_summary.json. Because the IBD experiment stores it under ensemble/, the path is passed explicitly:

pipeline = ProductionPipeline( experiment_dir="results/ibd_franzosa", ensemble_summary_path="ensemble/ensemble_summary.json", )

The pipeline prints a confirmation:

✓ Loaded ensemble configuration: Target: Study.Group Method: copeland Models: 5 Inner CV Score: 0.6853

Retraining on all training data

During the experiment, each model was trained on a single CV fold. train_production_models retrains every ensemble member using all 220 samples. Each model keeps its original preprocessing (taxonomic filtering, compositional transformation, feature set):

pipeline.train_production_models( train_data_path=PROFILE_PATH, train_targets_path=METADATA_PATH, sample_id_column="Sample", target_column="Study.Group", save_dir="app/backend/production_models/ibd_franzosa", )

Output files

After training completes, app/backend/production_models/ibd_franzosa/ contains:

ibd_franzosa/ XGBoost_n_estimators-1000.pkl RandomForestClassifier_min_samples_leaf-5_n_estimators-1000_random_state-91.pkl ... # one .pkl per ensemble member ensemble_config.json # aggregation strategy, model list, score pipeline_configs.json # per-model preprocessing settings

Verifying locally

Before starting the backend, the pipeline can predict directly. The script runs the full ensemble (preprocessing, per-model prediction, Copeland aggregation) on the training data as a sanity check:

predictions = pipeline.predict( test_data_path=PROFILE_PATH, sample_id_column="Sample", return_probabilities=True, )

This returns a DataFrame with one row per sample.

Loading a saved pipeline

A previously saved production directory can be reloaded without the original experiment:

loaded = ProductionPipeline.load_production("app/backend/production_models/ibd_franzosa") predictions = loaded.predict( test_data_path="new_samples.tsv", sample_id_column="Sample", )
Last updated on