Classification of microarray cancer data using ensemble approach

An ensemble of classifiers is created by combining predictions of multiple component classifiers for improving prediction performance. In this paper, we conduct experimental comparison of J48, NB, IBK on nine microarray cancer datasets and also analyze their performance with Bagging, Boosting and Stack Generalization. The experimental results show that all ensemble methods outperform the individual classification methods. We then present a method, referred to as SD-EnClass, for combining classifiers from different classification families into an ensemble, based on a simple estimation of each classifier’s class performance. The experimental results show that the proposed model improves classification accuracy, in comparison to simply selecting the best classifier in the combination. In the second stage, we combine the results of our proposed method with the results of Boosting, Bagging and Stacking using the combining method proposed, to obtain results which are significantly better than using Boosting, Bagging or Stacking alone.

[1]  Sankar K. Pal,et al.  Pattern Recognition: From Classical to Modern Approaches , 2001 .

[2]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[3]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[4]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[5]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[6]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[7]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[8]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[9]  René Vidal,et al.  Subspace Clustering , 2011, IEEE Signal Processing Magazine.

[10]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  L. Kuncheva,et al.  Combining classifiers: Soft computing solutions. , 2001 .

[12]  D. Steinberg CART: Classification and Regression Trees , 2009 .

[13]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[14]  D. W. Abbott Comparison of data analysis and classification algorithms for automatic target recognition , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[15]  Jugal K. Kalita,et al.  Triclustering in gene expression data analysis: A selected survey , 2011, 2011 2nd National Conference on Emerging Trends and Applications in Computer Science.

[16]  Jugal K. Kalita,et al.  Module extraction from subspace co-expression networks , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Robi Polikar Ensemble learning , 2009, Scholarpedia.

[19]  C. Li,et al.  Feature extraction and normalization algorithms for high‐density oligonucleotide gene expression array data , 2001, Journal of cellular biochemistry. Supplement.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Henrik Boström,et al.  On evidential combination rules for ensemble classifiers , 2008, 2008 11th International Conference on Information Fusion.

[22]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[23]  Daryl Pregibon,et al.  A Statistical Perspective on Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[24]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[25]  Jugal K. Kalita,et al.  Gene expression data clustering analysis: A survey , 2011, 2011 2nd National Conference on Emerging Trends and Applications in Computer Science.

[26]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[27]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[28]  Mehmet Tan,et al.  Positive unlabeled learning for deriving protein interaction networks , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[29]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[30]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[31]  Jugal K. Kalita,et al.  GERC: Tree Based Clustering for Gene Expression Data , 2011, 2011 IEEE 11th International Conference on Bioinformatics and Bioengineering.

[32]  JiangDaxin,et al.  Cluster Analysis for Gene Expression Data , 2004 .

[33]  Taghi M. Khoshgoftaar,et al.  Threshold-based feature selection techniques for high-dimensional bioinformatics data , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.