Managing duplicates in a web archive

Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1

[1]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Mário J. Silva,et al.  Searching and Archiving the Web with Tumba , 2003 .

[4]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[5]  Daniel Gomes,et al.  Characterizing a national community web , 2005, TOIT.

[6]  Hector Garcia-Molina,et al.  Archival storage for digital libraries , 1998, DL '98.

[7]  Anna Patterson Why Writing Your Own Search Engine Is Hard , 2004, ACM Queue.

[8]  Brian Berliner,et al.  CVS II: Parallelizing Software Dev elopment , 1998 .

[9]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[10]  Ben Y. Zhao,et al.  Awarded Best Student Paper! - Pond: The OceanStore Prototype , 2003 .

[11]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[12]  Michalis Vazirgiannis,et al.  Archiving the Greek Web , 2004 .

[13]  Josh Macdonald,et al.  Versioned File Archiving, Compression, and Distribution , 1999 .

[14]  Miguel Costa,et al.  The XLDB Group at CLEF 2004 , 2004, CLEF.

[15]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[16]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[17]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[18]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[19]  Terence Kelly,et al.  Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.

[20]  Daniel Gomes,et al.  Versus: A Web Repository , 2002 .

[21]  Timo Burkard,et al.  Herodotus: A Peer-to-Peer Web Archival System , 2002 .

[22]  Jeffrey C. Mogul,et al.  A trace-based analysis of duplicate suppression in HTTP , 2000 .

[23]  Christos T. Karamanolis,et al.  Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[24]  José Luis Borbinha,et al.  A Deposit for Digital Collections , 2001, ECDL.

[25]  Ethan L. Miller,et al.  A fast algorithm for online placement and reorganization of replicated data , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[26]  M. O. Rabin PROBABILISTIC ALGORITHM IN FINITE FIELDS , 1979 .

[27]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[28]  Daniel Gomes,et al.  Webstore: A Manager for Incremental Storage of Contents , 2004 .

[29]  Juha Hakala,et al.  The NEDLIB harvester , 2001 .

[30]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[31]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[32]  Ohad Rodeh,et al.  zFS - a scalable distributed file system using object disks , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[33]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[34]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[35]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[36]  Renato Iannella,et al.  Uniform Resource Names (URN) Namespace Definition Mechanisms , 2002, RFC.

[37]  Chabane Djeraba Dominos: A New Web Crawler's Design , 2004 .