论文信息 - Hypergeometric language models for republished article finding

Hypergeometric language models for republished article finding

Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.

Maarten de Rijke | Wouter Weerkamp | Manos Tsagkias

[1] Alberto Barrón-Cedeño,et al. Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[2] W. Bruce Croft,et al. Finding text reuse on the web , 2009, WSDM '09.

[3] David Kauchak,et al. Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[4] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[5] Alexander Löser,et al. Near-duplicate detection for web-forums , 2009, IDEAS '09.

[6] John D. Lafferty,et al. A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[7] Luis Gravano,et al. dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[8] Leo Egghe,et al. Duality in information retrieval and the hypergeometric distribution , 1997, J. Documentation.

[9] Rada Mihalcea,et al. Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[10] Jong Wook Kim,et al. Efficient overlap and content reuse detection in blogs and online news articles , 2009, WWW '09.

[11] Gianni Amati. Information Theoretic Approach to Information Extraction , 2006, FQAS.

[12] Kenneth T. Wallenius,et al. BIASED SAMPLING; THE NONCENTRAL HYPERGEOMETRIC PROBABILITY DISTRIBUTION , 1963 .

[13] W. John Wilbur,et al. Retrieval Testing with Hypergeometric Document Models , 1993, J. Am. Soc. Inf. Sci..

[14] Xuanjing Huang,et al. Efficient partial-duplicate detection based on sequence matching , 2010, SIGIR.

[15] W. Bruce Croft,et al. Local text reuse detection , 2008, SIGIR '08.

[16] Agner Fog,et al. Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution , 2008, Commun. Stat. Simul. Comput..