Generation of attribute value taxonomies from data for data-driven construction of accurate and compact classifiers

Attribute value taxonomies (AVT) have been shown to be useful in constructing compact, robust, and comprehensible classifiers. However, in many application domains, human-designed AVTs are unavailable. We introduce AVT-learner, an algorithm for automated construction of attribute value taxonomies from data. AVT-learner uses hierarchical agglomerative clustering (HAC) to cluster attribute values based on the distribution of classes that co-occur with the values. We describe experiments on UCI data sets that compare the performance of AVT-NBL (an AVT-guided naive Bayes learner) with that of the standard naive Bayes learner (NBL) applied to the original data set. Our results show that the AVTs generated by AVT-learner are competitive with human-gene rated AVTs (in cases where such AVTs are available). AVT-NBL using AVTs generated by AVT-learner achieves classification accuracies that are comparable to or higher than those obtained by NBL; and the resulting classifiers are significantly more compact than those generated by NBL.

[1]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[2]  Vasant Honavar,et al.  Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction , 2002, SARA.

[3]  Vasant Honavar,et al.  AVT-NBL: an algorithm for learning compact and accurate naive Bayes classifiers from attribute value taxonomies and data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[5]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[6]  Vasant Honavar,et al.  Identification of Surface Residues Involved in Protein-Protein Interaction — A Support Vector Machine Approach , 2003 .

[7]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[8]  Ron Kohavi,et al.  Applications of Data Mining to Electronic Commerce , 2000, Springer US.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Vasant Dhar,et al.  Abstract-Driven Pattern Discovery in Databases , 1992, IEEE Trans. Knowl. Data Eng..

[11]  James A. Hendler,et al.  Ontology-based Induction of High Level Classification Rules , 1997, DMKD.

[12]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[13]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Vasant Honavar,et al.  Automated discovery of concise predictive rules for intrusion detection , 2002, J. Syst. Softw..

[16]  Michael J. Pazzani,et al.  Learning Hierarchies from Ambiguous Natural Language Data , 1995, ICML.

[17]  Vasant Honavar,et al.  Learning decision tree classifiers from attribute value taxonomies and partially specified data , 2003, ICML 2003.

[18]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[19]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[20]  Vasant Honavar,et al.  Automated data-driven discovery of motif-based protein function classifiers , 2003, Inf. Sci..

[21]  Michael J. Pazzani,et al.  Beyond Concise and Colorful: Learning Intelligible Rules , 1997, KDD.

[22]  James A. Hendler,et al.  Advances in High Performance Knowledge Representation , 1996 .

[23]  Vasant Honavar,et al.  A Multi-relational Decision Tree Learning Algorithm - Implementation and Experiments , 2003, ILP.

[24]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[25]  Jiawei Han,et al.  Exploration of the power of attribute-oriented induction in data mining , 1995, KDD 1995.

[26]  Timothy W. Finin,et al.  A Target Centric Ontology for Intrusion Detection: Using DAML+OIL to Classify Intrusive Behaviors , 2004 .

[27]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[28]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[29]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[30]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[31]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.