Quantifying the Effects of Text Duplication on Semantic Models

Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document repetition. By artificially creating different forms of duplicate text we confirm several hypotheses about how repeated text impacts models. While a small amount of duplication is tolerable, substantial over-representation of subsets of the text may overwhelm meaningful topical patterns.

[1]  David M. Blei,et al.  Bayesian Checking for Topic Models , 2011, EMNLP.

[2]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[3]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[4]  John Lee,et al.  A Computational Model of Text Reuse in Ancient Literary Texts , 2007, ACL.

[5]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[6]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[7]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[8]  Iryna Gurevych,et al.  Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.

[9]  Benno Stein,et al.  New Issues in Near-duplicate Detection , 2007, GfKl.

[10]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[11]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[12]  Daniel Barbará,et al.  Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.

[13]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[17]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[18]  David A. Smith,et al.  Infectious texts: Modeling text reuse in nineteenth-century newspapers , 2013, 2013 IEEE International Conference on Big Data.

[19]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[20]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[21]  Adam Lopez,et al.  Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP , 2016 .

[22]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[23]  Paul Ginsparg,et al.  Patterns of text reuse in a scientific corpus , 2014, Proceedings of the National Academy of Sciences.

[24]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[25]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[26]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[27]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[28]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[29]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[30]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.