Abstract—This paper presents an analysis of the suitability ofgrammar-based genetic programming for the classification taskin data mining. The evolutionary technique is compared withseveral classic algorithms for inducing decision trees and rules,using classification accuracy as the comparison criterion. I. I NTRODUCTION Data mining (DM) consists of the extraction of useful, com-prehensible and previously unknown knowledge, from hugeamounts of data stored in different formats [16]. Classificationis one of the most studied problems by DM and machinelearning (ML) researchers. It consists in predicting the value ofa (categorical) attribute (the class) based on the values of otherattributes (the predicting attributes). In the ML and DM fields,classification is usually approached as a supervised learningtask. A search algorithm is used to induce a classifier from aset of correctly classified data instances, called the train set.Another set of correctly classified data instances, known as thetest set is used to measure the quality of the classifier obtainedafter the learning process. Different paradigms have been usedin order to tackle classification: decision trees [10], inductivelearning [8], instance-based learning [1] and, more recently,artificial neural networks [18] and evolutionary algorithms [4].In this paper, we focus on decision tree, rule induction andevolutionary techniques.Decision tree methods use greedy algorithms. These algo-rithms are generally fast, very effective, accurate and ableto classify data completely. Most decision tree methods userecursive partitioning techniques that split the data space.However, the greedy nature of these algorithms can overlookmultivariate relationships that can’t be found when attributesare considered separately. Rule induction algorithms usuallyemploy a specific-to-general approach, in which rules aregeneralized (or specialized) until a satisfactory descriptionof each class is obtained. Finally, evolutionary algorithms(EA) are based on the use of probabilistic search algorithmsinspired by certain points of the Darwinian theory of evolution.The flexibility and robustness of EAs allow the discoveryof complex relationships that are usually missed by otheralgorithms.In addition to the learning algorithm, another importantissue that must be considered in classification is the repre-sentation formalism. Rules are one of the most often usedformalisms used to represent classifiers, and is the one we havechosen for our work (a decision tree can be easily convertedinto a rule set [12]). The rule antecedent (IF part) contains acombination of conditions on the predicting attributes, and therule consequent (THEN part) contains the predicted value forthe class. This way, a rule assigns a data instance to the classpointed out by the consequent if the values of the predictingattributes satisfy the conditions expressed in the antecedent,and so, a classifier is represented as a rule set. The rules usedin our work have the following format.
[1]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[2]
Peter Clark,et al.
The CN2 induction algorithm
,
2004,
Machine Learning.
[3]
Peter A. Whigham,et al.
Grammatical bias for evolutionary learning
,
1996
.
[4]
Ian Witten,et al.
Data Mining
,
2000
.
[5]
Saso Dzeroski,et al.
Inductive Logic Programming: Techniques and Applications
,
1993
.
[6]
Dr. Alex A. Freitas.
Data Mining and Knowledge Discovery with Evolutionary Algorithms
,
2002,
Natural Computing Series.
[7]
Thomas Bäck,et al.
An Overview of Evolutionary Computation
,
1993,
ECML.
[8]
Robert C. Holte,et al.
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets
,
1993,
Machine Learning.
[9]
Ryszard S. Michalski,et al.
A theory and methodology of inductive learning
,
1993
.
[10]
Peter Nordin,et al.
Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications
,
1998
.
[11]
Aiko M. Hormann,et al.
Programs for Machine Learning. Part I
,
1962,
Inf. Control..
[12]
Alex Alves Freitas,et al.
Book Review: Data Mining Using Grammar-Based Genetic Programming and Applications
,
2001,
Genetic Programming and Evolvable Machines.
[13]
David W. Aha,et al.
Instance-Based Learning Algorithms
,
1991,
Machine Learning.
[14]
M. M. Kilgo,et al.
Statistics and Data Analysis: From Elementary to Intermediate
,
2001
.
[15]
J. Ross Quinlan,et al.
Induction of Decision Trees
,
1986,
Machine Learning.
[16]
Ryszard S. Michalski,et al.
On the Quasi-Minimal Solution of the General Covering Problem
,
1969
.
[17]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques with Java implementations
,
2002,
SGMD.
[18]
Jacek M. Zurada,et al.
Introduction to artificial neural systems
,
1992
.
[19]
Alex Alves Freitas,et al.
Guest editorial data mining and knowledge discovery with evolutionary algorithms
,
2003,
IEEE Trans. Evol. Comput..
[20]
Sreerama K. Murthy,et al.
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey
,
1998,
Data Mining and Knowledge Discovery.