A Hierarchical Ensemble of ECOC for cancer classification based on multi-class microarray data

The difficulty of the cancer classification using multi-class microarray datasets lies in that there are only a few samples in each class. To effectively solve such a problem, we propose a hierarchical ensemble strategy, named as Hierarchical Ensemble of Error Correcting Output Codes (HE-ECOC). In this strategy, different feature subsets extracted from a dataset are used as inputs for three data-dependent ECOC algorithms, so as to produce different ECOC coding matrices. The mutual diversity degrees among these coding matrices are then calculated based on two schemes, named as the maximizing local diversity (MLD) and the maximizing global diversity (MGD) schemes. Both schemes can choose diverse coding matrices generated by the same or different ECOC algorithm(s), and the average fusion scheme is used to fuse the outputs of base learners. In the experiments, it is found that both MLD and MGD based HE-ECOC strategies work stably, and outperform individual single ECOC algorithms. In contrast with some ensemble systems, HE-ECOC generates a more robust ensemble system, and achieves better performance in most case. In short, HE-ECOC is a promising solution for the multi-class problem. The matlab code is available upon request.

[1]  Sergio Escalera,et al.  ECOC-ONE: A Novel Coding and Decoding Strategy , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Ching Y. Suen,et al.  A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[5]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Evolutionary design of multiclass support vector machines , 2007, J. Intell. Fuzzy Syst..

[6]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[7]  Sergio Escalera,et al.  On the Decoding Process in Ternary Error-Correcting Output Codes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  H L Yu,et al.  Multiclass microarray data classification based on confidence evaluation. , 2012, Genetics and molecular research : GMR.

[9]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[10]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[11]  Ehsanollah Kabir,et al.  A subspace approach to error correcting output codes , 2013, Pattern Recognit. Lett..

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[14]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[15]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[16]  Sergio Escalera,et al.  Boosted Landmarks of Contextual Descriptors and Forest-ECOC: A novel framework to detect and classify objects in cluttered scenes , 2007, Pattern Recognit. Lett..

[17]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[18]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[19]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[20]  Elizabeth Tapia,et al.  Recursive ECOC classification , 2010, Pattern Recognit. Lett..

[21]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[22]  Sergio Escalera,et al.  Subclass Problem-Dependent Design for Error-Correcting Output Codes , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Fabio Roli,et al.  A theoretical and experimental analysis of linear combiners for multiple classifier systems , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  L. Kuncheva An application of OWA operators to the aggregation of multiple classification decisions , 1997 .

[25]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[26]  Muchenxuan Tong,et al.  An ensemble of SVM classifiers based on gene pairs , 2013, Comput. Biol. Medicine.

[27]  Muchenxuan Tong,et al.  Genetic Programming Based Ensemble System for Microarray Data Classification , 2015, Comput. Math. Methods Medicine.

[28]  Nicolás García-Pedrajas,et al.  Evolving Output Codes for Multiclass Problems , 2008, IEEE Transactions on Evolutionary Computation.

[29]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jordi Vitrià,et al.  Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[32]  Jordi Vitrià,et al.  Minimal design of error-correcting output codes , 2012, Pattern Recognit. Lett..

[33]  Chun-Gui Xu,et al.  A genetic programming-based approach to the classification of multiclass microarray datasets , 2009, Bioinform..

[34]  Sergio Escalera,et al.  On the design of an ECOC-Compliant Genetic Algorithm , 2014, Pattern Recognit..

[35]  Jing-Yu Yang,et al.  Optimal discriminant plane for a small number of samples and design method of classifier on the plane , 1991, Pattern Recognit..

[36]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[37]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[38]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[39]  Sergio Escalera,et al.  Error-Correcting Ouput Codes Library , 2010, J. Mach. Learn. Res..

[40]  Giorgio Valentini,et al.  Effectiveness of Error Correcting Output Codes in Multiclass Learning Problems , 2000, Multiple Classifier Systems.

[41]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[42]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[43]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[44]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[45]  Manuel Graña,et al.  Hybrid extreme rotation forest , 2014, Neural Networks.

[46]  Elizabeth Tapia,et al.  Multiclass classification of microarray data samples with a reduced number of genes , 2011, BMC Bioinformatics.

[47]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[48]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..