Pin-pointing concept descriptions

In this study, the task of obtaining accurate and comprehensible concept descriptions of a specific set of production instances has been investigated. The suggested method, inspired by rule extraction and transductive learning, uses a highly accurate opaque model, called an oracle, to coach construction of transparent decision list models. The decision list algorithms evaluated are JRip and four different variants of Chipper, a technique specifically developed for concept description. Using 40 real-world data sets from the drug discovery domain, the results show that employing an oracle coach to label the production data resulted in significantly more accurate and smaller models for almost all techniques. Furthermore, augmenting normal training data with production data labeled by the oracle also led to significant increases in predictive performance, but with a slight increase in model size. Of the techniques evaluated, normal Chipper optimizing FOIL's information gain and allowing conjunctive rules was clearly the best. The overall conclusion is that oracle coaching works very well for concept description.

[1]  Tuve Löfström,et al.  Oracle Coached Decision Trees and Lists , 2010, IDA.

[2]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[3]  Joachim Diederich,et al.  Survey and critique of techniques for extracting rules from trained artificial neural networks , 1995, Knowl. Based Syst..

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[8]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[9]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[10]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[11]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[12]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[13]  Lars Niklasson,et al.  Evolving decision trees using oracle guides , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[14]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[15]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[16]  Henrik Boström,et al.  Chipper - A Novel Algorithm for Concept Description , 2008, SCAI.