论文信息 - Improving a Page Classifier with Anchor Extraction and Link Analysis

Improving a Page Classifier with Anchor Extraction and Link Analysis

Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier's performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential "main-category anchor wrappers"; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.

William W. Cohen

[1] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[2] Nello Cristianini,et al. Composite Kernels for Hypertext Categorisation , 2001, ICML.

[3] Rayid Ghani,et al. Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[4] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[5] Andrew McCallum,et al. Learning with Scope, with Application to Information Extraction and Classification , 2002, UAI.

[6] William W. Cohen. Automatically Extracting Features for Concept Learning from the Web , 2000, International Conference on Machine Learning.

[7] Avrim Blum. Learning boolean functions in an infinite attribute space , 1990, STOC '90.

[8] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9] Tom M. Mitchell,et al. Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[10] William W. Cohen,et al. A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[11] William W. Cohen,et al. Learning Page-Independent Heuristics for Extracting Data from Web Pages , 1999, Comput. Networks.

[12] William W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[13] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness (Extended abstract) , 1998 .

[14] Craig A. Knoblock,et al. Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[15] David A. Cohn,et al. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.