POLITECNICO DI TORINO Repository ISTITUZIONALE Building Gene Expression Profile Classifiers with a Simple and Efficient Rejection Option in R /

BackgroundThe collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers.ResultsThis paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention.ConclusionsThis paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be available.

[1]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[2]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Combining One-Class Classifiers for Robust Novelty Detection in Gene Expression Data , 2005, BSB.

[3]  Jeyakumar Natarajan,et al.  Microarray Data Analysis and Mining Tools , 2011, Bioinformation.

[4]  Peter L. Bartlett,et al.  Classification with a Reject Option using a Hinge Loss , 2008, J. Mach. Learn. Res..

[5]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[6]  G. Gibson,et al.  Microarray Analysis , 2020, Definitions.

[7]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[8]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[9]  R. Nuttall,et al.  An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression. , 2001, Nucleic acids research.

[10]  Y. Mansour,et al.  Generalization bounds for averaged classifiers , 2004, math/0410092.

[11]  Claudio Lottaz,et al.  Gene-expression profiling identifies distinct subclasses of core binding factor acute myeloid leukemia , 2007 .

[12]  Nikolaus Hansen,et al.  On the Adaptation of Arbitrary Normal Mutation Distributions in Evolution Strategies: The Generating Set Adaptation , 1995, ICGA.

[13]  Alfredo Benso,et al.  A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[15]  David Botstein,et al.  Variation in gene expression patterns in follicular lymphoma and the response to rituximab , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Daijin Ko,et al.  Enriching for correct prediction of biological processes using a combination of diverse classifiers , 2011, BMC Bioinformatics.

[17]  Vito Di Gesù,et al.  A One Class Classifier for Signal Identification: A Biological Case Study , 2008, KES.

[18]  Robert P. W. Duin,et al.  Minimum spanning tree based one-class classifier , 2009, Neurocomputing.

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[21]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[22]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[23]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[24]  Ash A. Alizadeh,et al.  Cell-type specific gene expression profiles of leukocytes in human peripheral blood , 2006, BMC Genomics.

[25]  Robert P. W. Duin,et al.  Growing a multi-class classifier with a reject option , 2008, Pattern Recognit. Lett..

[26]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[27]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[28]  Nikolaus Hansen,et al.  A restart CMA evolution strategy with increasing population size , 2005, 2005 IEEE Congress on Evolutionary Computation.

[29]  Edward R. Dougherty,et al.  Application of the Bayesian MMSE estimator for classification error to gene expression microarray data , 2011, Bioinform..

[30]  Fabio Roli,et al.  Reject option with multiple thresholds , 2000, Pattern Recognit..

[31]  Yun Xu,et al.  Diagnostic Pattern Recognition on Gene-Expression Profile Data by Using One-Class Classification , 2005, J. Chem. Inf. Model..

[32]  Giancarlo Mauri,et al.  A comparison of machine learning techniques for survival prediction in breast cancer , 2011, BioData Mining.