Packaging models
This page covers retraining the IBD ensemble on the full dataset and saving it in a format the backend can load.
The deployment script
The complete workflow is implemented in example/IBD/deploy_ibd.py:
from mllabiome.production import ProductionPipeline
# --- Paths (same data used in the experiment) ---
EXPERIMENT_DIR = "results/ibd_franzosa"
ENSEMBLE_SUMMARY = "ensemble/ensemble_summary.json"
PROFILE_PATH = "example/IBD/data/FRANZOSA_IBD_2019_profiles_hierarchical.tsv"
METADATA_PATH = "example/IBD/data/metadata.tsv"
# --- Load ensemble configuration ---
pipeline = ProductionPipeline(
experiment_dir=EXPERIMENT_DIR,
ensemble_summary_path=ENSEMBLE_SUMMARY,
)
# --- Retrain all ensemble members on the full dataset ---
pipeline.train_production_models(
train_data_path=PROFILE_PATH,
train_targets_path=METADATA_PATH,
sample_id_column="Sample",
target_column="Study.Group",
save_dir="app/backend/production_models/ibd_franzosa",
)
# --- Quick sanity check: predict on the training data itself ---
predictions = pipeline.predict(
test_data_path=PROFILE_PATH,
sample_id_column="Sample",
return_probabilities=True,
)
print("\nSample predictions (first 5):")
print(predictions.head())
print(f"\nTotal samples predicted: {len(predictions)}")Run it from the repository root:
python example/IBD/deploy_ibd.pyThe rest of this page walks through each part of the script in detail.
Step-by-step breakdown
Loading the ensemble configuration
ProductionPipeline reads the experiment directory and locates ensemble_summary.json. Because the IBD experiment stores it under ensemble/, the path is passed explicitly:
pipeline = ProductionPipeline(
experiment_dir="results/ibd_franzosa",
ensemble_summary_path="ensemble/ensemble_summary.json",
)The pipeline prints a confirmation:
✓ Loaded ensemble configuration:
Target: Study.Group
Method: copeland
Models: 5
Inner CV Score: 0.6853Retraining on all training data
During the experiment, each model was trained on a single CV fold. train_production_models retrains every ensemble member using all 220 samples. Each model keeps its original preprocessing (taxonomic filtering, compositional transformation, feature set):
pipeline.train_production_models(
train_data_path=PROFILE_PATH,
train_targets_path=METADATA_PATH,
sample_id_column="Sample",
target_column="Study.Group",
save_dir="app/backend/production_models/ibd_franzosa",
)Output files
After training completes, app/backend/production_models/ibd_franzosa/ contains:
ibd_franzosa/
XGBoost_n_estimators-1000.pkl
RandomForestClassifier_min_samples_leaf-5_n_estimators-1000_random_state-91.pkl
... # one .pkl per ensemble member
ensemble_config.json # aggregation strategy, model list, score
pipeline_configs.json # per-model preprocessing settingsVerifying locally
Before starting the backend, the pipeline can predict directly. The script runs the full ensemble (preprocessing, per-model prediction, Copeland aggregation) on the training data as a sanity check:
predictions = pipeline.predict(
test_data_path=PROFILE_PATH,
sample_id_column="Sample",
return_probabilities=True,
)This returns a DataFrame with one row per sample.
Loading a saved pipeline
A previously saved production directory can be reloaded without the original experiment:
loaded = ProductionPipeline.load_production("app/backend/production_models/ibd_franzosa")
predictions = loaded.predict(
test_data_path="new_samples.tsv",
sample_id_column="Sample",
)