NaN handling

Multimodal data frequently contains missing values. A sample may be present in one modality but absent or incomplete in another. mllabiome provides per-modality NaN handling configured through ModalityConfig.nan_handling.

Available strategies

All strategies are members of mll.NaNHandlingStrategy:

Strategy	Behaviour
`DROP_SAMPLES`	Remove any sample that contains at least one NaN in this modality. This is the default.
`FILL_ZERO`	Replace NaN with `0`.
`FILL_MEAN`	Replace NaN with the column mean (numeric) or mode (categorical).
`FILL_MEDIAN`	Replace NaN with the column median (numeric) or mode (categorical).
`FILL_MODE`	Replace NaN with the most frequent value.
`FILL_CONSTANT`	Replace NaN with the value specified by `nan_fill_value`.
`KNN_IMPUTE`	K-nearest-neighbours imputation. The number of neighbours is controlled by `nan_knn_neighbors` (default 5).
`FORWARD_FILL`	Forward-fill (useful for time-series data).
`BACKWARD_FILL`	Backward-fill (useful for time-series data).
`INTERPOLATE`	Linear interpolation between existing values.

Configuration

NaN handling is set on each ModalityConfig individually, allowing different strategies per modality:


import mllabiome as mll
 
modality_a = mll.ModalityConfig(
    name="modality_a",
    file="modality_a_data.tsv",
    sample_id_column="Sample",
    nan_handling=mll.NaNHandlingStrategy.FILL_MEDIAN,
    verbose_nan_handling=True,
)
 
modality_b = mll.ModalityConfig(
    name="modality_b",
    file="modality_b_data.tsv",
    sample_id_column="Sample",
    nan_handling=mll.NaNHandlingStrategy.DROP_SAMPLES,
    verbose_nan_handling=True,
)

Verbose output

When verbose_nan_handling=True, a summary is printed during evaluation:


NaN handling [modality_a]: fill_median
  Found 26 NaN values in 6 samples
  Imputed 26 values -> 0 NaN remaining

Strategy-specific parameters

Parameter	Used by	Default	Description
`nan_fill_value`	`FILL_CONSTANT`	`None`	The constant value to fill. Required when using `FILL_CONSTANT`.
`nan_knn_neighbors`	`KNN_IMPUTE`	`5`	Number of neighbours for KNN imputation.