Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets

Biomedical datasets pose a unique challenge for machine learning and data mining techniques to extract accurate, comprehensible and hidden knowledge from them. In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. To this end, we quantify the complexity of a biomedical dataset in terms of its missing values, imbalance ratio, noise and information gain. We have performed our experiments using six well-known evolutionary rule learning algorithms – XCS, UCS, GAssist, cAnt-Miner, SLAVE and Ishibuchi – on 31 publicly available biomedical datasets. The results of our experiments and statistical analysis show that GAssist gives better classification results on majority of biomedical datasets among the compared schemes but cannot be categorized as the best classifier. Moreover, our analysis reveals that the nature of a biomedical dataset – not the selection of evolutionary algorithm – plays a major role in determining the classification accuracy of a dataset. We further show that noise is a dominating factor in determining the complexity of a dataset and it is inversely proportional to the classification accuracy of all evaluated algorithms. Towards the end, we provide researchers with a meta-classification model that can be used to determine the classification potential of a dataset on the basis of its complexity measures.

[1]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[2]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[3]  Ester Bernadó-Mansilla,et al.  Evolutionary rule-based systems for imbalanced data sets , 2008, Soft Comput..

[4]  Jaume Bacardit,et al.  Bloat Control and Generalization Pressure Using the Minimum Description Length Principle for a Pittsburgh Approach Learning Classifier System , 2005, IWLCS.

[5]  John Levine,et al.  Evolutionary approaches to fuzzy modelling for classification , 2004, Knowl. Eng. Rev..

[6]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[7]  Muddassar Farooq,et al.  The Role of Biomedical Dataset in Classification , 2009, AIME.

[8]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[9]  Ester Bernadó-Mansilla,et al.  Accuracy-Based Learning Classifier Systems: Models, Analysis and Applications to Classification Tasks , 2003, Evolutionary Computation.

[10]  Wai Lam,et al.  Discovering Knowledge from Medical Databases , 2000 .

[11]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[12]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[13]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[14]  Andrea Bonarini,et al.  Evolutionary Approaches to Fuzzy Modelling for Classification , 2022 .

[15]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[16]  Larry Bull,et al.  Mining breast cancer data with XCS , 2007, GECCO '07.

[17]  John H. Holmes Applying a Learning Classifier System to Mining Explanatory and Predictive Models from a Large Clinical Database , 2000, IWLCS.

[18]  Ester Bernadó-Mansilla,et al.  Revisiting UCS: Description, Fitness Sharing, and Comparison with XCS , 2008, IWLCS.

[19]  Martin V. Butz,et al.  Data Mining in Learning Classifier Systems: Comparing XCS with GAssist , 2005, IWLCS.

[20]  Tim Kovacs,et al.  Advances in Learning Classifier Systems , 2001, Lecture Notes in Computer Science.

[21]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[22]  Moshe Sipper,et al.  Evolutionary computation in medicine: an overview , 2000, Artif. Intell. Medicine.

[23]  Hisao Ishibuchi,et al.  Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[24]  Alex Alves Freitas,et al.  cAnt-Miner: An Ant Colony Classification Algorithm to Cope with Continuous Attributes , 2008, ANTS Conference.

[25]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[26]  Muhammad Zubair Shafiq,et al.  Guidelines to Select Machine Learning Scheme for Classification of Biomedical Datasets , 2009, EvoBIO.

[27]  Alex A. Freitas,et al.  An ant colony based system for data mining: applications to medical data , 2001 .

[28]  Xavier Llorà,et al.  XCS and GALE: A Comparative Study of Two Learning Classifier Systems on Data Mining , 2001, IWLCS.

[29]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[30]  John H. Holmes,et al.  Learning Classifier Systems Applied to Knowledge Discovery in Clinical Research Databases , 1999, Learning Classifier Systems.

[31]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[32]  Martin V. Butz,et al.  Toward a theory of generalization and learning in XCS , 2004, IEEE Transactions on Evolutionary Computation.

[33]  Larry Bull,et al.  Learning Classifier Systems , 2002, Annual Conference on Genetic and Evolutionary Computation.

[34]  Alex Alves Freitas,et al.  Data mining with an ant colony optimization algorithm , 2002, IEEE Trans. Evol. Comput..

[35]  Antonio González Muñoz,et al.  SLAVE: a genetic learning system based on an iterative approach , 1999, IEEE Trans. Fuzzy Syst..

[36]  O. J. Dunn Multiple Comparisons among Means , 1961 .