Multivariate Analysis of Data Sets with Missing Values: An Information Theory-Based Reliability Function

Abstract Missing values in complex biological data sets have significant impacts on our ability to correctly detect and quantify interactions in biological systems and to infer relationships accurately. In this article, we propose a useful metaphor to show that information theory measures, such as mutual information and interaction information, can be employed directly for evaluating multivariable dependencies even if data contain some missing values. The metaphor is that of thinking of variable dependencies as information channels between and among variables. In this view, missing data can be thought of as noise that reduces the channel capacity in predictable ways. We extract the available information in the data even if there are missing values and use the notion of channel capacity to assess the reliability of the result. This avoids the common practice—in the absence of prior knowledge of random imputation—of eliminating samples entirely, thus losing the information they can provide. We show how this reliability function can be implemented for pairs of variables, and generalize it for an arbitrary number of variables. Illustrations of the reliability functions for several cases are provided using simulated data.

[1]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[2]  Sergio Verdú,et al.  A general formula for channel capacity , 1994, IEEE Trans. Inf. Theory.

[3]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[4]  Antonio Reverter,et al.  Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest , 2014, BMC Bioinformatics.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[7]  Ting Chen,et al.  Inference of missing SNPs and information quantity measurements for haplotype blocks , 2005, Bioinform..

[8]  U. Bhalla,et al.  Complexity in biological signaling systems. , 1999, Science.

[9]  Marco Zaffalon,et al.  Distribution of mutual information from complete and incomplete data , 2004, Comput. Stat. Data Anal..

[10]  Michel Verleysen,et al.  Mutual information for feature selection with missing data , 2011, ESANN.

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  Nathan D. Price,et al.  Biological Information as Set-Based Complexity , 2010, IEEE Transactions on Information Theory.

[13]  Simon Litsyn,et al.  A new upper bound on the reliability function of the Gaussian channel , 2000, Proceedings of the 1999 IEEE Information Theory and Communications Workshop (Cat. No. 99EX253).

[14]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[15]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[16]  David J. Galas,et al.  Describing the Complexity of Systems: Multivariable "Set Complexity" and the Information Basis of Systems Biology , 2013, J. Comput. Biol..

[17]  Wenhao Shu,et al.  Mutual information criterion for feature selection from incomplete data , 2015, Neurocomputing.

[18]  Soher Balkhy,et al.  Mutations in Human Accelerated Regions Disrupt Cognition and Social Behavior , 2016, Cell.

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[20]  C. Williams,et al.  An entropy-based measure of founder informativeness. , 2005, Genetical research.

[21]  Aleksandr Yakovlevich Khinchin,et al.  Mathematical foundations of information theory , 1959 .

[22]  C. Adami,et al.  Evolution of Biological Complexity , 2000, Proc. Natl. Acad. Sci. USA.

[23]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[24]  David J. Galas,et al.  Biological Data Analysis as an Information Theory Problem: Multivariable Dependence Measures and the Shadows Algorithm , 2015, J. Comput. Biol..

[25]  David J. Galas,et al.  The Information Content of Discrete Functions and Their Application in Genetic Data Analysis , 2017, J. Comput. Biol..