Identification of Duplicate News Stories in Web Pages

Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata and other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from the page, then the accuracy of the duplicate detection algorithms is greatly increased. Using machine learning techniques to identify the content portion of web pages, we achieve accuracy that is nearly identical to plain text and significantly better than simple heuristic approaches to content extraction. We performed these experiments on a small, but fully annotated corpus.

[1]  Hassan Alam,et al.  Understanding the Flow of Content in Summarizing HTML Documents , 2001 .

[2]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[3]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[4]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[5]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[6]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[7]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[8]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[9]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[10]  Lluís Màrquez i Villodre,et al.  Semantic Role Labeling as Sequential Tagging , 2005, CoNLL.

[11]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[12]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[13]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[14]  Alvaro E. Monge Matching Algorithms within a Duplicate Detection System , 2000, IEEE Data Engineering Bulletin.

[15]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[16]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[17]  Ben Wellner,et al.  Leveraging Machine Readable Dictionaries in Discriminative Sequence Models , 2006, LREC.

[18]  Jason Baldridge,et al.  A Sequencing Model for Situation Entity Classification , 2007, ACL.

[19]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[20]  Joongmin Choi,et al.  MetaNews: An Information Agent for Gathering News Articles on the Web , 2003, ISMIS.

[21]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[22]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[23]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[24]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[25]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[26]  Judith L. Klavans,et al.  Columbia Newsblaster: Multilingual News Summarization on the Web , 2004, NAACL.