Using neighborhood information for automated categorization of Web pages

In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web pages. We report that using our approach in forming a Web page representation always leads to better results than traditional single Web page categorization.

[1]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[2]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[3]  Arul Prakash Asirvatham,et al.  Web Page Classification based on Document Structure , 2001 .

[4]  C. Lee Giles,et al.  Extracting query modifications from nonlinear SVMs , 2002, WWW '02.

[5]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[6]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[7]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[10]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  John M. Pierre,et al.  Practical Issues for Automated Categorization of Web Sites , 2000 .

[14]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[15]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .