Mapping microarray gene expression data into dissimilarity spaces for tumor classification

This paper presents a two-stage prediction model for microarray gene expression data.First, ReliefF is used to generate a subset with a number of top-ranked genes.Second, the samples with the subset of genes are mapped into a dissimilarity space.The classifier is able to separate the classes more easily than a feature-based model.Results show that the dissimilarity-based classifiers outperform the feature-based models. Microarray gene expression data sets usually contain a large number of genes, but a small number of samples. In this article, we present a two-stage classification model by combining feature selection with the dissimilarity-based representation paradigm. In the preprocessing stage, the ReliefF algorithm is used to generate a subset with a number of top-ranked genes; in the learning/classification stage, the samples represented by the previously selected genes are mapped into a dissimilarity space, which is then used to construct a classifier capable of separating the classes more easily than a feature-based model. The ultimate aim of this paper is not to find the best subset of genes, but to analyze the performance of the dissimilarity-based models by means of a comprehensive collection of experiments for the classification of microarray gene expression data. To this end, we compare the classification results of an artificial neural network, a support vector machine and the Fisher's linear discriminant classifier built on the feature (gene) space with those on the dissimilarity space when varying the number of genes selected by ReliefF, using eight different microarray databases. The results show that the dissimilarity-based classifiers systematically outperform the feature-based models. In addition, classification through the proposed representation appears to be more robust (i.e. less sensitive to the number of genes) than that with the conventional feature-based representation.

[1]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[2]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[3]  Kim-Anh Do,et al.  Bayesian ensemble methods for survival prediction in gene expression data , 2011, Bioinform..

[4]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[5]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  G. Berx,et al.  Gene expression profiling to dissect the complexity of cancer biology: pitfalls and promise. , 2012, Seminars in cancer biology.

[8]  Robert P. W. Duin,et al.  Dissimilarity-Based Detection of Schizophrenia , 2010, ICPR 2010.

[9]  Sanghyun Park,et al.  Direct integration of microarrays for selecting informative genes and phenotype classification , 2008, Inf. Sci..

[10]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[11]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[12]  Jorge S Reis-Filho,et al.  The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: a retrospective of the last decade , 2010, The Journal of pathology.

[13]  Der-Chiang Li,et al.  Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation , 2009, Inf. Sci..

[14]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[15]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[16]  Aboul Ella Hassanien,et al.  Computational intelligence techniques in bioinformatics , 2013, Comput. Biol. Chem..

[17]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[18]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19]  Edward R. Dougherty,et al.  Small Sample Issues for Microarray-Based Classification , 2001, Comparative and functional genomics.

[20]  Ana Maria Mendonça,et al.  Dissimilarity-based classification of chromatographic profiles , 2008, Pattern Analysis and Applications.

[21]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[22]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[23]  Hong-Wen Deng,et al.  Gene selection for classification of microarray data based on the Bayes error , 2007, BMC Bioinformatics.

[24]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[25]  Anton Berns,et al.  Cancer: Gene expression in diagnosis , 2000, Nature.

[26]  Wen Du,et al.  New Variable Selection Method Using Interval Segmentation Purity with Application to Blockwise Kernel Transform Support Vector Machine Classification of High-Dimensional Microarray Data , 2009, J. Chem. Inf. Model..

[27]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[28]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Phongphun Kijsanayothin,et al.  Tumor classification ranking from microarray data , 2008, BMC Genomics.

[30]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[31]  Kaushik Mahata,et al.  Selecting differentially expressed genes using minimum probability of classification error , 2007, J. Biomed. Informatics.

[32]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[33]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[34]  Robert P. W. Duin,et al.  Dissimilarity representations allow for building good classifiers , 2002, Pattern Recognit. Lett..

[35]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[36]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[37]  Robert P. W. Duin,et al.  The Dissimilarity Representation for Pattern Recognition - Foundations and Applications , 2005, Series in Machine Perception and Artificial Intelligence.

[38]  P. Conilione,et al.  A Comparative Study on Feature Selection for E . coli Promoter Recognition A Comparative Study on Feature Selection for E . coli Promoter Recognition , 2006 .

[39]  M. Ringnér,et al.  Microarray-based cancer diagnosis with artificial neural networks. , 2003, BioTechniques.

[40]  Wei Jia,et al.  Robust Classification Method of Tumor Subtype by Using Correlation Filters , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  Jiawei Han,et al.  Selection of interdependent genes via dynamic relevance analysis for cancer diagnosis , 2013, J. Biomed. Informatics.

[42]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[43]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.