The Effect of Topological Structure on Hierarchical Text Categorization

The problem of assigning documents to categories in a hierarchically organized taxonomy and the effect of modifying the topology of the hierarchy are considered. Given a training corpus of documents already placed in categories, vocabulary is extracted. The vocabulary, words that appear with high relative frequency within a given category, characterize each subject area by being associated with nodes in the hierarchy. Each node's vocabulary is filtered and its words assigned weights with respect to the specific category. Test documents are scanned for this vocabulary and categories are ranked with respect to the document based on the presence of terms from this vocabulary. Documents are assigned to categories based on these rankings. Precision and recall are measured. We present an algorithm for associating words with individual categories within the hierarchy and demonstrate that precision and recall can be significantly improved by solving the categorization problem taking the topology of the hierarchy into account. We also show that these results can be improved even further by inteUigent'y selecting intermediate categories in the hierarchy. Solving the problem iteratively, moving downward from the root of the taxonomy to the leaf nodes, we improve precision from 82% to 89% and recall from 82% to 87% on the much-studied Reuters-21578 corpus with 135 categories organized in a three-level hierarchy of categories. 1 I n t r o d u c t i o n a n d B a c k g r o u n d The proliferation of available online information attributable to the explosive use of the Internet has brought about the necessity for text retrieval systems that can assist the user in accessing this information in an effective, efficient and timely manner. Today's search engines have had difificulty keeping pace with the increasing amount of information that continuously needs to be indexed and searched. Categorization of the original text is a means by which the information can be arranged arid organized to facilitate the retrieval task. Natural language processing systems can be used to query against these pre-specified categories yielding retrieval results more acceptable and beneficial to the user. The document categorization problem is one of assigning newly arriving documents to categories within a given hierarchy of categories. In general, lower level categories may be part of more than one higher level category. Moreover, a document may belong to more than one low-level category. While the techniques described here can be applied to this more general problem, the experiments we have conducted, to date, have been carried out on a corpus where each document is a member of a single category and the categories form a tree rather than a more general directed acyclic graph. Vv~ limited the investigation to this more specific problem in order to focus the investigation on the effect of making use of the hierarchy, specifically on changes

[1]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[4]  Gilbert H. Young,et al.  ACTION: automatic classification for full-text documents , 1996, SIGF.

[5]  David Heckerman,et al.  Bayesian Networks for Knowledge Discovery , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[7]  David D. Lewis Text representation for intelligent text retrieval: a classification-oriented view , 1992 .

[8]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[9]  Aaron Kershenbaum,et al.  Category Levels in Hierarchical Text Categorization , 1998, EMNLP.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[12]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[13]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[14]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[15]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[16]  Y Yang An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[17]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[19]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .