Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL

For easy comprehensibility, rules are preferrable to non-linear kernel functions in the analysis of bio-medical data. In this paper, we describe two rule induction approaches—C4.5 and our PCL classifier—for discovering rules from both traditional clinical data and recent gene expression or proteomic profiling data. C4.5 is a widely used method, but it has two weaknesses, the single coverage constraint and the fragmentation problem, that affect its accuracy. PCL is a new rule-based classifier that overcomes these two weaknesses of decision trees by using many significant rules. We present a thorough comparison to show that our PCL method is much more accurate than C4.5, and it is also superior to Bagging and Boosting in general.

[1]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Kotagiri Ramamohanarao,et al.  The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms , 2000, ICML.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Jinyan Li,et al.  Solving the fragmentation problem of decision trees by discovering boundary emerging patterns , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. , 2002 .

[8]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .

[9]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[10]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[11]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[12]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[13]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[14]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[15]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[16]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.