Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

Abstract. We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of “noise” words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[3]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Tzay Y. Young,et al.  Classification, Estimation and Pattern Recognition , 1974 .

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  R. Lippmann Pattern classification using neural networks , 1989, IEEE Communications Magazine.

[11]  D Sutcliffe,et al.  Untangling the web. , 1989, Community outlook.

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[14]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[15]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Balas K. Natarajan,et al.  Machine Learning: A Theoretical Approach , 1992 .

[18]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[21]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[22]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[23]  D.R. Hush,et al.  Progress in supervised neural networks , 1993, IEEE Signal Processing Magazine.

[24]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[25]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[26]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[27]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[28]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[29]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[30]  Douglas W. Oard,et al.  A survey of information retrieval and filtering methods , 1995 .

[31]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[32]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[33]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[34]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[35]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[36]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[37]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[38]  Yoav Shoham,et al.  Content-Based, Collaborative Recommendation. , 1997 .

[39]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[40]  Bradley N. Miller,et al.  Experiences with GroupLens: marking usenet useful again , 1997 .

[41]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[42]  Shivakumar Vaithyanathan,et al.  Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.

[43]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[44]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[45]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[46]  Eric Sven Ristad,et al.  A natural law of succession , 1995, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).