Skip to Content

Splitting data across clients

Before running a federated session, each client site needs its own subset of the data. In a real deployment, each site already has its own samples. For this tutorial, the IBD dataset is split programmatically.

Running the split script

python example/IBD/split_ibd_federated.py

This reads FRANZOSA_IBD_2019_profiles_hierarchical.tsv and metadata.tsv, assigns each sample to one of three clients using stratified random assignment (preserving the CD/UC/Control ratio), and writes per-client files.

Output structure

example/IBD/data/federated/ client_1/ profile.tsv # taxonomic profiles for ~73 samples metadata.tsv # metadata for the same samples client_2/ profile.tsv metadata.tsv client_3/ profile.tsv metadata.tsv

What the script does

  1. Loads the metadata file and extracts the Study.Group column (the target variable).
  2. Performs a stratified 3-way split using scikit-learn. Each client receives approximately one third of the samples, with the CD/UC/Control proportions preserved.
  3. For each client, subsets the profile matrix to that client’s samples and writes both the profile and metadata files.

The script prints a summary:

client_1: 73 samples (CD=29, UC=25, Control=19) client_2: 73 samples (CD=29, UC=25, Control=19) client_3: 74 samples (CD=30, UC=26, Control=18)

Verifying the split

Each client’s profile file should contain only that client’s samples and all 71,013 features (taxa). The metadata file should contain only the corresponding rows:

head -1 example/IBD/data/federated/client_1/profile.tsv | tr '\t' '\n' | wc -l wc -l example/IBD/data/federated/client_1/metadata.tsv
Last updated on