Hypergeometric language models for republished article finding

Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.

[1]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[2]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[3]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[4]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[5]  Alexander Löser,et al.  Near-duplicate detection for web-forums , 2009, IDEAS '09.

[6]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[7]  Luis Gravano,et al.  dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[8]  Leo Egghe,et al.  Duality in information retrieval and the hypergeometric distribution , 1997, J. Documentation.

[9]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[10]  Jong Wook Kim,et al.  Efficient overlap and content reuse detection in blogs and online news articles , 2009, WWW '09.

[11]  Gianni Amati Information Theoretic Approach to Information Extraction , 2006, FQAS.

[12]  Kenneth T. Wallenius,et al.  BIASED SAMPLING; THE NONCENTRAL HYPERGEOMETRIC PROBABILITY DISTRIBUTION , 1963 .

[13]  W. John Wilbur,et al.  Retrieval Testing with Hypergeometric Document Models , 1993, J. Am. Soc. Inf. Sci..

[14]  Xuanjing Huang,et al.  Efficient partial-duplicate detection based on sequence matching , 2010, SIGIR.

[15]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[16]  Agner Fog,et al.  Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution , 2008, Commun. Stat. Simul. Comput..

[17]  Jong Wook Kim,et al.  Organization and Tagging of Blog and News Entries Based on Content Reuse , 2010, J. Signal Process. Syst..

[18]  Andrew Trotman,et al.  Overview of the INEX 2010 Link the Wiki Track , 2010, INEX.

[19]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[20]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[21]  Leo Egghe,et al.  A Theoretical Study of Recall and Precision Using a Topological Approach to Information Retrieval , 1998, Inf. Process. Manag..

[22]  Monika Henzinger,et al.  Detecting the origin of text segments efficiently , 2009, WWW '09.

[23]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[24]  Daisuke Ikeda,et al.  Automatically Linking News Articles to Blog Entries , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[25]  D. S. Moore,et al.  The Basic Practice of Statistics , 2001 .

[26]  Bill N. Schilit,et al.  Generating links by mining quotations , 2008, Hypertext.

[27]  Gianni Amati,et al.  Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.

[28]  M. de Rijke,et al.  Linking online news and social media , 2011, WSDM '11.

[29]  Felipe Bravo-Marquez,et al.  Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval , 2010, SPIRE.

[30]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[31]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[32]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[33]  Jenq-Haur Wang,et al.  Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[34]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[35]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Vector-Space and Other Retrieval Models , 1997, Inf. Process. Manag..

[36]  O. Vorobyev,et al.  Discrete multivariate distributions , 2008, 0811.0406.

[37]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[38]  S. Robertson The probability ranking principle in IR , 1997 .

[39]  Ram Akella,et al.  A new probabilistic retrieval model based on the dirichlet compound multinomial distribution , 2008, SIGIR '08.

[40]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[41]  Craig MacDonald,et al.  Using Relevance Feedback in Expert Search , 2007, ECIR.

[42]  David S. Moore,et al.  The Basic Practice of Statistics [With CDROM] , 1999 .