Cluster-based KNN missing value imputation for DNA microarray data

Gene expressions measured using microarrays usually encounter the problem of missing values. Leaving this unsolved may critically degrade the reliability of any consequent down-stream analysis or medical application. Yet, a further study of microarray data might be impossible with many analysis methods requiring a complete data set. This paper introduces a new methodology to impute missing values in microarray data. The proposed algorithm, CKNN impute, is an extension of k nearest neighbor imputation with local data clustering being incorporated for improved quality and efficiency. Gene expression data is typically represented as a matrix whose rows and columns correspond to genes and experiments, respectively. CKNN kicks off by finding a complete dataset via the removal of rows with missing value(s). Then, k clusters and their corresponding centroids are obtained by applying a clustering technique on the complete dataset. A set of similar genes of the target gene (with missing values) are those belonging to the cluster, whose centroid is the closest the target. Having known this, the target gene is imputed by applying k nearest neighbor method with similar genes previously determined. Empirical evaluation with published gene expression datasets suggest that the proposed technique performs better than the classical k nearest neighbor method and its extension found in the literature.

[1]  Shichao Zhang,et al.  "Missing is useful": missing values in cost-sensitive decision trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Chengqi Zhang,et al.  Guest Editors' Introduction: Information Enhancement for Data Mining , 2004, IEEE Intell. Syst..

[3]  Ming Ouyang,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[4]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[5]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[6]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[7]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[8]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[9]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[10]  Jing Zhu,et al.  Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules , 2006, Bioinform..

[11]  Natthakan Iam-On,et al.  LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles , 2010 .

[12]  M. Monden,et al.  Construction of preferential cDNA microarray specialized for human colorectal carcinoma: molecular sketch of colorectal cancer. , 2001, Biochemical and biophysical research communications.

[13]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[14]  Wang Ling,et al.  Estimation of Missing Values Using a Weighted K-Nearest Neighbors Algorithm , 2009, 2009 International Conference on Environmental Science and Information Application Technology.

[15]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[16]  Rainer Spang,et al.  Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. , 2003, Drug discovery today.

[17]  Shichao Zhang,et al.  Information enhancement for data mining , 2011, WIREs Data Mining Knowl. Discov..

[18]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Anders Wallqvist,et al.  Establishing connections between microarray expression data and chemotherapeutic cancer pharmacology. , 2002, Molecular cancer therapeutics.

[20]  Tero Aittokallio,et al.  Missing value imputation improves clustering and interpretation of gene expression microarray data , 2008, BMC Bioinformatics.

[21]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[22]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[23]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[24]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.