Splitting data across clients
Before running a federated session, each client site needs its own subset of the data. In a real deployment, each site already has its own samples. For this tutorial, the IBD dataset is split programmatically.
Running the split script
python example/IBD/split_ibd_federated.pyThis reads FRANZOSA_IBD_2019_profiles_hierarchical.tsv and metadata.tsv, assigns each sample to one of three clients using stratified random assignment (preserving the CD/UC/Control ratio), and writes per-client files.
Output structure
example/IBD/data/federated/
client_1/
profile.tsv # taxonomic profiles for ~73 samples
metadata.tsv # metadata for the same samples
client_2/
profile.tsv
metadata.tsv
client_3/
profile.tsv
metadata.tsvWhat the script does
- Loads the metadata file and extracts the
Study.Groupcolumn (the target variable). - Performs a stratified 3-way split using scikit-learn. Each client receives approximately one third of the samples, with the CD/UC/Control proportions preserved.
- For each client, subsets the profile matrix to that client’s samples and writes both the profile and metadata files.
The script prints a summary:
client_1: 73 samples (CD=29, UC=25, Control=19)
client_2: 73 samples (CD=29, UC=25, Control=19)
client_3: 74 samples (CD=30, UC=26, Control=18)Verifying the split
Each client’s profile file should contain only that client’s samples and all 71,013 features (taxa). The metadata file should contain only the corresponding rows:
head -1 example/IBD/data/federated/client_1/profile.tsv | tr '\t' '\n' | wc -l
wc -l example/IBD/data/federated/client_1/metadata.tsvLast updated on