Skip to Content

NaN handling

Multimodal data frequently contains missing values. A sample may be present in one modality but absent or incomplete in another. mllabiome provides per-modality NaN handling configured through ModalityConfig.nan_handling.

Available strategies

All strategies are members of mll.NaNHandlingStrategy:

StrategyBehaviour
DROP_SAMPLESRemove any sample that contains at least one NaN in this modality. This is the default.
FILL_ZEROReplace NaN with 0.
FILL_MEANReplace NaN with the column mean (numeric) or mode (categorical).
FILL_MEDIANReplace NaN with the column median (numeric) or mode (categorical).
FILL_MODEReplace NaN with the most frequent value.
FILL_CONSTANTReplace NaN with the value specified by nan_fill_value.
KNN_IMPUTEK-nearest-neighbours imputation. The number of neighbours is controlled by nan_knn_neighbors (default 5).
FORWARD_FILLForward-fill (useful for time-series data).
BACKWARD_FILLBackward-fill (useful for time-series data).
INTERPOLATELinear interpolation between existing values.

Configuration

NaN handling is set on each ModalityConfig individually, allowing different strategies per modality:

import mllabiome as mll modality_a = mll.ModalityConfig( name="modality_a", file="modality_a_data.tsv", sample_id_column="Sample", nan_handling=mll.NaNHandlingStrategy.FILL_MEDIAN, verbose_nan_handling=True, ) modality_b = mll.ModalityConfig( name="modality_b", file="modality_b_data.tsv", sample_id_column="Sample", nan_handling=mll.NaNHandlingStrategy.DROP_SAMPLES, verbose_nan_handling=True, )

Verbose output

When verbose_nan_handling=True, a summary is printed during evaluation:

NaN handling [modality_a]: fill_median Found 26 NaN values in 6 samples Imputed 26 values -> 0 NaN remaining

Strategy-specific parameters

ParameterUsed byDefaultDescription
nan_fill_valueFILL_CONSTANTNoneThe constant value to fill. Required when using FILL_CONSTANT.
nan_knn_neighborsKNN_IMPUTE5Number of neighbours for KNN imputation.
Last updated on