论文信息 - An experimental study on large-scale web categorization

An experimental study on large-scale web categorization

Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, k-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems.

[1] Yiming Yang,et al. A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[2] Dunja Mladenic,et al. Word sequences as features in text-learning , 1998 .

[3] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[4] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[5] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6] Giuseppe Attardi,et al. Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[7] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..