Learning Web Categorization with Controlled Generation of Context Features

Automatic categorization of Web pages is an important area of study due to the rapidly growing amount of Web data. Efficient and accurate classification would greatly facilitate finding what one needs in the sea of information. Context-sensitive techniques have been proven to be effective in the classification task. However, the feature space for context feature that one can explore in these techniques is enormous. To consider these features comprehensively often become prohibitive in terms of resource requirements. In this paper, we propose an approach to intelligently control generating context features for the classification learning process. We present our investigation of this approach in the context of Web page categorization using the sleeping-experts technique.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Avrim Blum Learning boolean functions in an infinite attribute space , 1990, STOC '90.

[5]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[8]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[9]  Yoram Singer,et al.  Learning to Query the Web , 1996 .

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[12]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[13]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.