An enhanced approach on handling missing values using bagging k-NN imputation

Researchers in the database community have aroused great interest in handling high dimensional data sets for the past decades. Today's business captures inundate sets of data which includes digital documents, web pages-customer databases, hyper-spectral imagery, social networks, gene arrays, proteomics data, neurobiological signals, high dimensional dynamical systems, sensor networks, financial transactions and traffic statistics thereby generating massive high dimensional datasets. DNA microarray paves methods in identifying different expression levels of thousands of genes during biological process. The problem with microarrays is to measure gene expression from thousands of genes (features) from only tens of hundreds of samples. Microarray data often contain several missing values that may affect subsequent analysis. In this paper, a novel approach on imputation using k-NN with bagging method is proposed to handle missing value. The experimental result shows that the proposed method outperforms other methods in terms of distance and density of clusters. The proposed approach has enhanced the performance of traditional k-NN impute using bagging method.

[1]  E. Dougherty,et al.  Multivariate measurement of gene expression relationships. , 2000, Genomics.

[2]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Peter Johansson,et al.  Improving missing value imputation of microarray data by using spot quality weights , 2006, BMC Bioinformatics.

[6]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Yogendra Kumar Jain,et al.  Min Max Normalization Based Data Perturbation Method for Privacy Protection , 2011 .

[9]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[10]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[12]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[13]  M. Monden,et al.  Construction of preferential cDNA microarray specialized for human colorectal carcinoma: molecular sketch of colorectal cancer. , 2001, Biochemical and biophysical research communications.

[14]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[15]  YanWang,et al.  Missing value estimation for microarray data based on fuzzy C-means clustering , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[16]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[17]  Iqbal Gondal,et al.  Collateral Missing Value Estimation: Robust Missing Value Estimation for Consequent Microarray Data Processing , 2005, Australian Conference on Artificial Intelligence.

[18]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[19]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[20]  Atul J. Butte,et al.  Determining Significant Fold Differences in Gene Expression Analysis , 2000, Pacific Symposium on Biocomputing.

[21]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Chris Chatfield,et al.  19. Statistical Analysis with Missing Data , 1988 .

[23]  S. Ishii,et al.  Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data , 2003, Genome Biology.

[24]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[25]  M. Bittner,et al.  Expression profiling using cDNA microarrays , 1999, Nature Genetics.

[26]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[27]  Aidong Zhang,et al.  Interactive visualization and analysis for gene expression data , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[28]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Azadeh Mohammadi,et al.  Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[30]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[31]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[32]  C. Y. Peng,et al.  Advances in Missing Data Methods and Implications for Educational Research , 2006 .

[33]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .