Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

MOTIVATION Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. RESULTS The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. AVAILABILITY The CMVE software is available upon request from the authors.

[1]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[2]  Iqbal Gondal,et al.  A Collimator Neural Network Model for the Classification of Genetic Data , 2005, Advances in Bioinformatics and Its Applications.

[3]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Iqbal Gondal,et al.  Support vector machine and generalized regression neural network based classification fusion models for cancer diagnosis , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[5]  Iqbal Gondal,et al.  K-ranked covariance based missing values estimation for microarray data classification , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[6]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[7]  Christos Sotiriou,et al.  Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. , 2002, Journal of the National Cancer Institute.

[8]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Kamesh Munagala,et al.  Cancer characterization and feature set extraction by discriminative margin clustering , 2004, BMC Bioinformatics.

[10]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[11]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[12]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[13]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[14]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[15]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[16]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[17]  Iqbal Gondal,et al.  Communal Neural Network for Ovarian Cancer Mutation Classification , 2004 .

[18]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[19]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[20]  Roberto Maass-Moreno,et al.  Fitting Models to Biological Data Using Linear and Nonlinear Regression: A Practical Guide to Curve Fitting.ByHarvey Motulskyand, Arthur Christopoulos.Oxford and New York: Oxford University Press. $65.00 (hardcover); $29.95 (paper). 351 p; ill.; index. ISBN: 0–19–517179–9 (hc); 0–19–517180–2 (pb). 2 , 2005 .

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Alan Lloyd McLean,et al.  The Predictive Approach to Teaching Statistics , 1999, Journal of Statistics Education.

[23]  Iqbal Gondal,et al.  Statistical neural networks and support vector machine for the classification of genetic mutations in ovarian cancer , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[24]  Peter H Gann,et al.  Overdiagnosis due to prostate-specific antigen screening: lessons from U.S. prostate cancer incidence trends. , 2002, Journal of the National Cancer Institute.