Combining link-based and content-based methods for web document classification

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

[1]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[4]  Mike Thelwall,et al.  Finding similar academic Web sites with links, bibliometric couplings and colinks , 2004, Inf. Process. Manag..

[5]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[6]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[7]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[8]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Mounia Lalmas,et al.  A probabilistic description-oriented approach for categorizing web documents , 1999, CIKM '99.

[11]  Berthier A. Ribeiro-Neto,et al.  CoBWeb-a crawler for the Brazilian Web , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[12]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[13]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[14]  Loren G. Terveen,et al.  Constructing, organizing, and visualizing collections of topically related Web resources , 1999, TCHI.

[15]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[16]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[17]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[18]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[19]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[20]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[21]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[22]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[23]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Berthier A. Ribeiro-Neto,et al.  Local versus global link information in the Web , 2003, TOIS.

[27]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[28]  Fabio Crestani,et al.  Soft computing in information retrieval: techniques and applications , 2000 .

[29]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[30]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.