Impute Gene Expression Missing Values via Biological Networks: Optimal Fusion of Data and Knowledge

Gene expression data often contain missing values that, if not handled properly, may mislead or invalidate the downstream analyses. With the emergence of graph neural networks (GNN), domain knowledge about gene regulation can be leveraged to guide the missing data imputation. We show in this paper, however, that naive application of GNN on the raw gene-expression data can actually lead to worse imputation. We analyse this problem considering both the intrinsic property of GNN message passing and potential data-knowledge inconsistency. We propose two measures towards optimal integration of biological networks in the gene-expression missing data imputation. These include expression data normalisation and a weighting scheme for GNN message passing. Experiments on two different biological networks and gene expression datasets show that our method outperforms state-of-the-art generic imputation algorithms and alternative GNN models, obtaining lower mean absolute error (MAE) consistently.