A genetic-algorithm for discovering small-disjunct rules in data mining

This paper addresses the well-known classification task of data mining, where the goal is to discover rules predicting the class of examples (records of a given dataset). In the context of data mining, small disjuncts are rules covering a small number of examples. Hence, these rules are usually error-prone, which contributes to a decrease in predictive accuracy. At first glance, this is not a serious problem, since the impact on predictive accuracy should be small. However, although each small-disjunct covers few examples, the set of all small disjuncts can cover a large number of examples. This paper presents evidence that this is the case in several datasets. This paper also addresses the problem of small disjuncts by using a hybrid decision-tree/genetic-algorithm approach. In essence, examples belonging to large disjuncts are classified by rules produced by a decision-tree algorithm (C4.5), while examples belonging to small disjuncts are classified by a genetic-algorithm specifically designed for discovering small-disjunct rules. We present results comparing the predictive accuracy of this hybrid system with the prediction accuracy of three versions of C4.5 alone in eight public domain datasets. Overall, the results show that our hybrid system achieves better predictive accuracy than all three versions of C4.5 alone.

[1]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[2]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[3]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[4]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[5]  Deborah R. Carvalho,et al.  A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining , 2000, GECCO.

[6]  Ivan Bratko,et al.  Machine Learning and Data Mining; Methods and Applications , 1998 .

[7]  Deborah R. Carvalho,et al.  A Genetic Algorithm-Based Solution for the Problem of Small Disjuncts , 2000, PKDD.

[8]  Alex A. Freitas,et al.  Discovering interesting prediction rules with a genetic algorithm , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[9]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[12]  Wynne Hsu,et al.  Multi-level organization and summarization of the discovered rules , 2000, KDD '00.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Foster J. Provost,et al.  Small Disjuncts in Action: Learning to Diagnose Errors in the Local Loop of the Telephone Network , 1993, ICML.

[15]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Haym Hirsh,et al.  A Quantitative Study of Small Disjuncts , 2000, AAAI/IAAI.

[18]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[19]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[20]  Alneu de Andrade Lopes,et al.  Integrating Rules and Cases in Learning via Case Explanation and Paradigm Shift , 2000, IBERAMIA-SBIA.

[21]  Zbigniew Michalewicz,et al.  Genetic algorithms + data structures = evolution programs (3rd ed.) , 1996 .

[22]  Larry A. Rendell,et al.  Learning hard concepts through constructive induction: framework and rationale , 1990, Comput. Intell..

[23]  Haym Hirsh,et al.  The Problem with Noise and Small Disjuncts , 1998, ICML.

[24]  Kenneth A. Kaufman,et al.  Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach , 1997 .

[25]  M. A. Bramer,et al.  Estimating concept difficulty with cross entropy , 1999, KDD 1999.