Combination of KNN-Based Feature Selection and KNNBased Missing-Value Imputation of Microarray Data

Microarrays are useful biological resource to study living forms at the molecule level. Microarrays usually have only few samples but high dimensionality with many missing values. The consequent downstream analysis becomes less efficiency. This paper proposes a methodology to impute missing values in microarray data. The proposed methodology is a combination of KNN-based feature selection and KNN-based imputation (KNNFS impute). The KNNFS impute comprises of two main ideas: feature selection and estimation of new values. A comparative study of the proposed method with traditional KNN and row average methods has been presented for the estimation of the missing values on three microarray data sets: lung tumor, colon cancer, and ALL-AML leukemia dataset. The best estimation results are measured by the minimum normalized root mean squared error (NRMSE). The results show that the proposed method has powerful estimation ability on the three data sets with smaller NRMSE than the compared methods.

[1]  Xinglai Ji,et al.  Mining gene expression data using a novel approach based on hidden Markov models , 2003, FEBS letters.

[2]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[3]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Ernst Wit,et al.  Statistics for Microarrays : Design, Analysis and Inference , 2004 .

[7]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[8]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[9]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[11]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.