Detecting and modeling local text reuse

Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.

[1]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[2]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[3]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[4]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[5]  Meredith L. McGill,et al.  American Literature and the Culture of Reprinting, 1834-1853 , 2002 .

[6]  Mark Olsen,et al.  Something Borrowed: Sequence Alignment and the Identification of Similar Passages in Large Text Collections , 2011 .

[7]  W. Bruce Croft,et al.  Efficient indexing of repeated n-grams , 2011, WSDM '11.

[8]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[9]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Alexandr Andoni,et al.  Homomorphic fingerprints under misalignments: sketching edit and shift distances , 2013, STOC '13.

[12]  John D. Wilkerson,et al.  Congress and the Politics of Problem Solving , 2013 .

[13]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[14]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[15]  Lise Getoor,et al.  Collective Graph Identification , 2016, ACM Trans. Knowl. Discov. Data.

[16]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[17]  Rok Sosic,et al.  NIFTY: a system for large scale information flow tracking and clustering , 2013, WWW.

[18]  Matthew Lease,et al.  Finding and exploring memes in social media , 2012, HT '12.

[19]  R. Manmatha,et al.  Partial duplicate detection for large book collections , 2011, CIKM '11.

[20]  Bill N. Schilit,et al.  Generating links by mining quotations , 2008, Hypertext.