Instance-based concept learning from multiclass DNA microarray data

BackgroundVarious statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance.Figure 2Prediction errors on the ALL data set. The total number of misclassified cases in all ten folds are: 247 by distance-weighted k-NN, 257 by 1-NN, 248 by 3-NN, 248 by 5-NN, 250 by SVM, 348 by DT, and 333 by MLP.ResultsWe investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors.ConclusionGiven its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.

[1]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[2]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[3]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[4]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[5]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[6]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[7]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[8]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[9]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[10]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[11]  Annette M. Molinaro,et al.  Loss-based estimation with cross-validation: applications to microarray data analysis , 2003, SKDD.

[12]  C. Weinberg,et al.  Gene Selection and Sample Classification Using a Genetic Algorithm and k-Nearest Neighbor Method , 2003 .

[13]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Chen-An Tsai,et al.  Multi-class clustering and prediction in the analysis of microarray data. , 2005, Mathematical biosciences.

[15]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  Richard Simon,et al.  Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n) , 2003, SKDD.

[19]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[20]  Musa H. Asyali,et al.  Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods , 2005, Bioinform..

[21]  J. Dagorn,et al.  Cloning and expression of the mRNA of human galectin-4, an S-type lectin down-regulated in colorectal cancer. , 1997, European journal of biochemistry.

[22]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[23]  S. Dudoit,et al.  Introduction to Classification in Microarray Experiments , 2003 .

[24]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Lawrence Carin,et al.  Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data , 2004, J. Comput. Biol..

[26]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[27]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[29]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[31]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[32]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[33]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[34]  Eivind Hovig,et al.  Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data , 2003, BMC Bioinformatics.

[35]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[36]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.