Splitting data across clients

Before running a federated session, each client site needs its own subset of the data. In a real deployment, each site already has its own samples. For this tutorial, the IBD dataset is split programmatically.

Running the split script


python example/IBD/split_ibd_federated.py

This reads FRANZOSA_IBD_2019_profiles_hierarchical.tsv and metadata.tsv, assigns each sample to one of three clients using stratified random assignment (preserving the CD/UC/Control ratio), and writes per-client files.

Output structure


example/IBD/data/federated/
  client_1/
    profile.tsv            # taxonomic profiles for ~73 samples
    metadata.tsv           # metadata for the same samples
  client_2/
    profile.tsv
    metadata.tsv
  client_3/
    profile.tsv
    metadata.tsv

What the script does

Loads the metadata file and extracts the Study.Group column (the target variable).
Performs a stratified 3-way split using scikit-learn. Each client receives approximately one third of the samples, with the CD/UC/Control proportions preserved.
For each client, subsets the profile matrix to that client’s samples and writes both the profile and metadata files.

The script prints a summary:


client_1: 73 samples (CD=29, UC=25, Control=19)
client_2: 73 samples (CD=29, UC=25, Control=19)
client_3: 74 samples (CD=30, UC=26, Control=18)

Verifying the split

Each client’s profile file should contain only that client’s samples and all 71,013 features (taxa). The metadata file should contain only the corresponding rows:


head -1 example/IBD/data/federated/client_1/profile.tsv | tr '\t' '\n' | wc -l
wc -l example/IBD/data/federated/client_1/metadata.tsv