Packaging models

This page covers retraining the IBD ensemble on the full dataset and saving it in a format the backend can load.

The deployment script

The complete workflow is implemented in example/IBD/deploy_ibd.py:

example/IBD/deploy_ibd.py


from mllabiome.production import ProductionPipeline
 
# --- Paths (same data used in the experiment) ---
EXPERIMENT_DIR = "results/ibd_franzosa"
ENSEMBLE_SUMMARY = "ensemble/ensemble_summary.json"
PROFILE_PATH = "example/IBD/data/FRANZOSA_IBD_2019_profiles_hierarchical.tsv"
METADATA_PATH = "example/IBD/data/metadata.tsv"
 
# --- Load ensemble configuration ---
pipeline = ProductionPipeline(
    experiment_dir=EXPERIMENT_DIR,
    ensemble_summary_path=ENSEMBLE_SUMMARY,
)
 
# --- Retrain all ensemble members on the full dataset ---
pipeline.train_production_models(
    train_data_path=PROFILE_PATH,
    train_targets_path=METADATA_PATH,
    sample_id_column="Sample",
    target_column="Study.Group",
    save_dir="app/backend/production_models/ibd_franzosa",
)
 
# --- Quick sanity check: predict on the training data itself ---
predictions = pipeline.predict(
    test_data_path=PROFILE_PATH,
    sample_id_column="Sample",
    return_probabilities=True,
)
 
print("\nSample predictions (first 5):")
print(predictions.head())
print(f"\nTotal samples predicted: {len(predictions)}")

Run it from the repository root:


python example/IBD/deploy_ibd.py

The rest of this page walks through each part of the script in detail.

Step-by-step breakdown

Loading the ensemble configuration

ProductionPipeline reads the experiment directory and locates ensemble_summary.json. Because the IBD experiment stores it under ensemble/, the path is passed explicitly:


pipeline = ProductionPipeline(
    experiment_dir="results/ibd_franzosa",
    ensemble_summary_path="ensemble/ensemble_summary.json",
)

The pipeline prints a confirmation:


✓ Loaded ensemble configuration:
  Target: Study.Group
  Method: copeland
  Models: 5
  Inner CV Score: 0.6853

Retraining on all training data

During the experiment, each model was trained on a single CV fold. train_production_models retrains every ensemble member using all 220 samples. Each model keeps its original preprocessing (taxonomic filtering, compositional transformation, feature set):


pipeline.train_production_models(
    train_data_path=PROFILE_PATH,
    train_targets_path=METADATA_PATH,
    sample_id_column="Sample",
    target_column="Study.Group",
    save_dir="app/backend/production_models/ibd_franzosa",
)

Output files

After training completes, app/backend/production_models/ibd_franzosa/ contains:


ibd_franzosa/
  XGBoost_n_estimators-1000.pkl
  RandomForestClassifier_min_samples_leaf-5_n_estimators-1000_random_state-91.pkl
  ...                                     # one .pkl per ensemble member
  ensemble_config.json                     # aggregation strategy, model list, score
  pipeline_configs.json                    # per-model preprocessing settings

Verifying locally

Before starting the backend, the pipeline can predict directly. The script runs the full ensemble (preprocessing, per-model prediction, Copeland aggregation) on the training data as a sanity check:


predictions = pipeline.predict(
    test_data_path=PROFILE_PATH,
    sample_id_column="Sample",
    return_probabilities=True,
)

This returns a DataFrame with one row per sample.

Loading a saved pipeline

A previously saved production directory can be reloaded without the original experiment:


loaded = ProductionPipeline.load_production("app/backend/production_models/ibd_franzosa")
 
predictions = loaded.predict(
    test_data_path="new_samples.tsv",
    sample_id_column="Sample",
)