论文信息 - Tailoring Taxonomies for Efficient Text Categorization and Expert Finding

Tailoring Taxonomies for Efficient Text Categorization and Expert Finding

Automatic content categorization by means of taxonomies is a powerful tool for information retrieval and search technologies as it improves the accessibility of data both for humans and machines. While research on automatic categorization has mainly focused on the problem of classifier design, hardly any effort has been spent on the optimization of the taxonomy size itself. However, taxonomy tailoring may significantly improve computational efficiency and scalability of modern retrieval systems where taxonomies often consist of tens of thousands of non-uniformly distributed categories. In this paper we demonstrate empirically that small subtrees of a taxonomy already enable reliable categorization. We compare several measures for the optimal selection of sub-taxonomies and investigate to what extent a reduction affects the classification quality. We consider applications in classical document categorization and in the upcoming area of expert finding and report corresponding results obtained from experiments with standard benchmark data.

[1] Stephen C. Gates,et al. Taxonomies by the numbers: building high-performance taxonomies , 2005, CIKM '05.

[2] Tom M. Mitchell,et al. Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[3] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[4] Petteri Nurmi. Perseus -- A Personalized Reputation System , 2007 .

[5] Sharad Mehrotra,et al. Grouping and Aggregate queries Over Semantic Web Databases , 2007 .

[6] Daphne Koller,et al. Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[7] Huan Liu,et al. Acclimatizing Taxonomic Semantics for Hierarchical Content Classification , 2006, KDD '06.

[8] Susan T. Dumais,et al. Hierarchical classification of Web content , 2000, SIGIR '00.

[9] Florian Metze,et al. The "Spree" Expert Finding System , 2007 .

[10] Christian Bauckhage,et al. An unsupervised hierarchical approach to document categorization , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[11] Florian Metze,et al. The "Spree" Expert Finding System , 2007, International Conference on Semantic Computing (ICSC 2007).

[12] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[13] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.