Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods

Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.

[1]  L. Buydens,et al.  Supervised Kohonen networks for classification problems , 2006 .

[2]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[3]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[4]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[5]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[6]  Henrik Boström,et al.  Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[7]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[8]  Lutgarde M. C. Buydens,et al.  SOMPLS: A supervised self-organising map--partial least squares algorithm for multivariate regression problems , 2007 .

[9]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[10]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[15]  John Quackenbush Microarray analysis and tumor classification. , 2006, The New England journal of medicine.

[16]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[17]  H. Abdi Partial Least Squares (PLS) Regression. , 2003 .

[18]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[19]  A. Boulesteix Statistical Applications in Genetics and Molecular Biology PLS Dimension Reduction for Classification with Microarray Data , 2011 .

[20]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[21]  Ian Witten,et al.  Data Mining , 2000 .