Evaluation of Missing Value Estimation for Microarray Data

Microarray gene expression data contains missing values (MVs). However, some methods for downstream analyses, including some predic- tion tools, require a complete expression data matrix. Current methods for estimating the MVs include sample mean and K-nearest neighbors (KNN). Whether the accuracy of estimation (imputation) methods depends on the actual gene expression has not been thoroughly investigated. Under this set- ting, we examine how the accuracy depends on the actual expression level and propose new methods that provide improvements in accuracy relative to the current methods in certain ranges of gene expression. In particular, we propose regression methods, namely multiple imputation via ordinary least squares (OLS) and missing value prediction using partial least squares (PLS). Mean estimation of MVs ignores the observed correlation structure of the genes and is highly inaccurate. Estimating MVs using KNN, a method which incorporates pairwise gene expression information, provides substan- tial improvement in accuracy on average. However, the accuracy of KNN across the wide range of observed gene expression is unlikely to be uniform and this is revealed by evaluating accuracy as a function of the expression level.

[1]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[2]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[3]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[4]  A. Höskuldsson PLS regression methods , 1988 .

[5]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[6]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[7]  Danh V. Nguyen,et al.  Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares , 2002 .

[8]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[9]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[10]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[11]  Raymond J Carroll,et al.  DNA Microarray Experiments: Biological and Technological Aspects , 2002, Biometrics.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[14]  I. Helland ON THE STRUCTURE OF PARTIAL LEAST SQUARES REGRESSION , 1988 .