Missing value imputation for gene expression data: computational techniques to recover missing data from available information

Microarray gene expression data generally suffers from missing value problem due to a variety of experimental reasons. Since the missing data points can adversely affect downstream analysis, many algorithms have been proposed to impute missing values. In this survey, we provide a comprehensive review of existing missing value imputation algorithms, focusing on their underlying algorithmic techniques and how they utilize local or global information from within the data, or their use of domain knowledge during imputation. In addition, we describe how the imputation results can be validated and the different ways to assess the performance of different imputation algorithms, as well as a discussion on some possible future research directions. It is hoped that this review will give the readers a good understanding of the current development in this field and inspire them to come up with the next generation of imputation algorithms.

[1]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[2]  Xiaofeng Song,et al.  Sequential local least squares imputation estimating missing value of microarray data , 2008, Comput. Biol. Medicine.

[3]  E. Dougherty,et al.  Multivariate measurement of gene expression relationships. , 2000, Genomics.

[4]  Ming Ouyang,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[5]  Hong Yan,et al.  Discovering biclusters in gene expression data based on high-dimensional linear geometries , 2008, BMC Bioinformatics.

[6]  Hong Yan,et al.  Microarray missing data imputation based on a set theoretic framework and biological knowledge , 2006, Nucleic acids research.

[7]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[8]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[9]  Hong Yan,et al.  Spectral estimation in unevenly sampled space of periodically expressed microarray time series data. , 2007, BMC bioinformatics.

[10]  David Botstein,et al.  The Stanford Microarray Database , 2001, Nucleic Acids Res..

[11]  Iqbal Gondal,et al.  Ameliorative missing value imputation for robust biological knowledge inference , 2008, J. Biomed. Informatics.

[12]  Alan Wee-Chung Liew,et al.  Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization , 2008, BMC Bioinformatics.

[13]  Ziv Bar-Joseph,et al.  Deconvolving cell cycle expression data with complementary information , 2004, ISMB/ECCB.

[14]  Haifeng Li,et al.  Integrative missing value estimation for microarray data , 2006, BMC Bioinformatics.

[15]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[16]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[17]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[18]  Peter Johansson,et al.  Improving missing value imputation of microarray data by using spot quality weights , 2006, BMC Bioinformatics.

[19]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Jing Zhu,et al.  Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules , 2006, Bioinform..

[21]  Tommi S. Jaakkola,et al.  Continuous Representations of Time-Series Gene Expression Data , 2003, J. Comput. Biol..

[22]  Wenxuan Zhong,et al.  Statistical assessment of the global regulatory role of histone acetylation in Saccharomyces cerevisiae , 2006, Genome Biology.

[23]  Taesung Park,et al.  Robust imputation method for missing values in microarray data , 2007, BMC Bioinformatics.

[24]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[25]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[26]  Ming Ouyang,et al.  A meta-data based method for DNA microarray imputation , 2007, BMC Bioinformatics.

[27]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[28]  J. Hoheisel Microarray technology: beyond transcript profiling and genotype analysis , 2006, Nature Reviews Microbiology.

[29]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[30]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[31]  Tero Aittokallio,et al.  Missing value imputation improves clustering and interpretation of gene expression microarray data , 2008, BMC Bioinformatics.

[32]  R. Moll,et al.  Desmosomal plakophilin 2 as a differentiation marker in normal and malignant tissues. , 1999, Differentiation; research in biological diversity.

[33]  Edward R. Dougherty,et al.  Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data—A Model-Based Study , 2010, EURASIP J. Bioinform. Syst. Biol..

[34]  Guohui Lin,et al.  Iterated Local Least Squares Microarray Missing Value Imputation , 2006, J. Bioinform. Comput. Biol..

[35]  Hong Yan,et al.  Autoregressive-Model-Based Missing Value Estimation for DNA Microarray Time Series Data , 2009, IEEE Transactions on Information Technology in Biomedicine.

[36]  Saeed Tavazoie,et al.  Mapping Global Histone Acetylation Patterns to Gene Expression , 2004, Cell.

[37]  S. Ishii,et al.  Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data , 2003, Genome Biology.

[38]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[39]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[40]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[41]  Hong Yan,et al.  A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. , 2008, Journal of theoretical biology.

[42]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[43]  Vincent J. Carey,et al.  Ontology concepts and tools for statistical genomics , 2004 .

[44]  Xiaobo Zhou,et al.  Missing-value estimation using linear and non-linear regression with Bayesian gene selection , 2003, Bioinform..

[45]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[46]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[47]  K. Shedden,et al.  Analysis of cell-cycle-specific gene expression in human cells as determined by microarrays and double-thymidine block synchronization , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[48]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[49]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[50]  A. Malpertuy,et al.  Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments , 2010, BMC Genomics.

[51]  Carole R. Baskin,et al.  Integration of Clinical Data, Pathology, and cDNA Microarrays in Influenza Virus-Infected Pigtailed Macaques (Macaca nemestrina) , 2004, Journal of Virology.

[52]  L. Verdone,et al.  Role of histone acetylation in the control of gene expression. , 2005, Biochemistry and cell biology = Biochimie et biologie cellulaire.

[53]  Mauro Dell'Amico,et al.  Assignment Problems , 1998, IFIP Congress: Fundamentals - Foundations of Computer Science.

[54]  W. Franke,et al.  Plakophilins 2a and 2b: constitutive proteins of dual location in the karyoplasm and the desmosomal plaque , 1996, The Journal of cell biology.

[55]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[56]  Jiang Wang,et al.  Missing value imputation for microarray gene expression data using histone acetylation information , 2008, BMC Bioinformatics.

[57]  Patrik Edén,et al.  Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data , 2008, BMC Genomics.

[58]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[59]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[60]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[61]  M. Bittner,et al.  Expression profiling using cDNA microarrays , 1999, Nature Genetics.

[62]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[63]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[64]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[65]  B. H. Miller,et al.  Coordinated Transcription of Key Pathways in the Mouse by the Circadian Clock , 2002, Cell.

[66]  K. Horwitz,et al.  Estradiol regulates different genes in human breast tumor xenografts compared with the identical cells in culture. , 2006, Endocrinology.

[67]  Xiang Guo,et al.  Histone acetylation and transcriptional regulation in the genome of Saccharomyces cerevisiae , 2006, Bioinform..

[68]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..