Taxonomies by the numbers: building high-performance taxonomies

In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features of the system can be used to predict when categories are sufficiently well defined to yield high accuracy categorization. We describe the use of this system to construct a large (8800-category) general-purpose taxonomy and categorization system.

[1]  W. Scott Spangler,et al.  The integration of business intelligence and knowledge management , 2002, IBM Syst. J..

[2]  Seth Earley,et al.  Practical knowledge management : the lotus knowledge discovery system , 2001 .

[3]  W. Scott Spangler,et al.  Interactive methods for taxonomy editing and validation , 2002, CIKM '02.

[4]  Michael Pelikan,et al.  Searching for the needle in the haystack: taxonomies, tags and targets , 2004, SIGUCCS '04.

[5]  Branimir Boguraev,et al.  The talent system: TEXTRACT architecture and data model , 2003, HLT-NAACL 2003.

[6]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Rubén Prieto-Díaz Implementing faceted classification for software reuse , 1991, CACM.

[9]  Michelangelo Ceci,et al.  Automated Classification of Web Documents into a Hierarchy of Categories , 2003, IIS.

[10]  Christopher Dougherty,et al.  The Lotus Knowledge Discovery System: Tools and experiences , 2001, IBM Syst. J..

[11]  R. Prieto-Diaz,et al.  Implementing faceted classification for software reuse , 1990, [1990] Proceedings. 12th International Conference on Software Engineering.

[12]  Andrei Z. Broder,et al.  Towards the next generation of enterprise search technology , 2004, IBM Syst. J..

[13]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[14]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[15]  Nicolas Spyratos,et al.  Mediators over taxonomy-based information sources , 2005, The VLDB Journal.

[16]  Diego Sona,et al.  Bootstrapping for hierarchical document classification , 2003, CIKM '03.

[17]  Iraklis Varlamis,et al.  SEWeP: using site semantics and a taxonomy to enhance the Web personalization process , 2003, KDD '03.

[18]  Li Zhang,et al.  InfoAnalyzer: a computer-aided tool for building enterprise taxonomies , 2004, CIKM '04.

[19]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[20]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[21]  David A. Landgrebe,et al.  Partially supervised classification using weighted unsupervised clustering , 1999, IEEE Trans. Geosci. Remote. Sens..