Some data mining problems require predictive models to be not only accurate but also comprehensible. Comprehensibility enables human inspection and understanding of the model, making it possible to trace why individual predictions are made. Since most high-accuracy techniques produce opaque models, accuracy is, in practice, regularly sacrificed for comprehensibility. One frequently studied technique, often able to reduce this accuracy vs. comprehensibility tradeoff, is rule extraction, i.e., the activity where another, transparent, model is generated from the opaque. In this paper, it is argued that techniques producing transparent models, either directly from the dataset, or from an opaque model, could benefit from using an oracle guide. In the experiments, genetic programming is used to evolve decision trees, and a neural network ensemble is used as the oracle guide. More specifically, the datasets used by the genetic programming when evolving the decision trees, consist of several different combinations of the original training data and “oracle data”, i.e., training or test data instances, together with corresponding predictions from the oracle. In total, seven different ways of combining regular training data with oracle data were evaluated, and the results, obtained on 26 UCI datasets, clearly show that the use of an oracle guide improved the performance. As a matter of fact, trees evolved using training data only had the worst test set accuracy of all setups evaluated. Furthermore, statistical tests show that two setups, both using the oracle guide, produced significantly more accurate trees, compared to the setup using training data only.
[1]
Athanasios Tsakonas,et al.
A comparison of classification accuracy of four genetic programming-evolved intelligent structures
,
2006,
Inf. Sci..
[2]
Mark Craven,et al.
Rule Extraction: Where Do We Go from Here?
,
1999
.
[3]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques, 3rd Edition
,
1999
.
[4]
Hongjun Lu,et al.
NeuroRule: A Connectionist Approach to Data Mining
,
1995,
VLDB.
[5]
Ulf Johansson,et al.
Obtaining Accurate and Comprehensible Data Mining Models: An Evolutionary Approach
,
2007
.
[6]
Joachim Diederich,et al.
Survey and critique of techniques for extracting rules from trained artificial neural networks
,
1995,
Knowl. Based Syst..
[7]
Jude W. Shavlik,et al.
in Advances in Neural Information Processing
,
1996
.
[8]
Alex A. Freitas,et al.
Data Mining with Constrained-syntax Genetic Programming: Applications in Medical Data Sets
,
2001
.
[9]
Lars Niklasson,et al.
G-REX: A Versatile Framework for Evolutionary Data Mining
,
2008,
2008 IEEE International Conference on Data Mining Workshops.
[10]
Wei-Yin Loh,et al.
Classification and regression trees
,
2011,
WIREs Data Mining Knowl. Discov..
[11]
Lars Niklasson,et al.
Why Not Use an Oracle When You Got One
,
2006
.
[12]
M. Friedman.
The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance
,
1937
.
[13]
Janez Demsar,et al.
Statistical Comparisons of Classifiers over Multiple Data Sets
,
2006,
J. Mach. Learn. Res..