论文信息 - User-defined Redundancy in Web Archives

User-defined Redundancy in Web Archives

Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to userspecific requirements such as a preference for more recent or diversely opinionated content. In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimumcardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.

Klaus Berberich | Avishek Anand | Bibek Paudel

[1] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[2] Jimmy J. Lin,et al. Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[3] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[4] Daniel Gomes,et al. Managing duplicates in a web archive , 2006, SAC.

[5] Kjetil Nørvåg. Granularity reduction in temporal document databases , 2006, Inf. Syst..

[6] Felix Naumann,et al. An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[7] Thomas Seidl,et al. CC-MR - Finding Connected Components in Huge Graphs with MapReduce , 2012, ECML/PKDD.

[8] Éva Tardos,et al. Algorithm design , 2005 .

[9] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[10] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[11] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[12] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).