Introducing the Portuguese web archive initiative

This paper introduces the Portuguese Web Archive initiative, presenting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new challenges. It is discussed how the terms index size could be reduced without significantly decreasing the quality of search results. The results obtained from the first performed crawl show that the Portuguese web is composed approximately at least by 54 million contents that correspond to 2.8 TB of data. The crawl of the Portuguese web was stored in 2 TB of disk space using the ARC compressed format.

[1]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[2]  Kristinn Sigurðsson Managing duplicates across sequential crawls , 2010 .

[3]  Torsten Suel,et al.  Efficient search in large textual collections with redundancy , 2007, WWW '07.

[4]  T. Drugeon A technical approach for the french web legal deposit , 2005 .

[5]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[6]  Miguel Costa,et al.  Optimizing Ranking Calculation in Web Search Engines: a Case Study , 2004, SBBD.

[7]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types , 1996, RFC.

[8]  Allan Arvidson,et al.  The Kulturarw Project — The Swedish Royal Web Archive , 1998 .

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Serge Abiteboul,et al.  A First Experience in Archiving the French Web , 2002, ECDL.

[11]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[12]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[13]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[14]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[15]  José Luis Borbinha,et al.  A Deposit for Digital Collections , 2001, ECDL.

[16]  Michael Stack Full Text Search of Web Archive Collections , 2005 .

[17]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[18]  Jon Postel,et al.  Domain Name System Structure and Delegation , 1994, RFC.

[19]  Daniel Gomes,et al.  The Viúva Negra crawler: an experience report , 2008, Softw. Pract. Exp..

[20]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[21]  Daniel Gomes,et al.  Design and Selection Criteria for a National Web Archive , 2006, ECDL.

[22]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[23]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[24]  David Wolinsky,et al.  On the Design of Virtual Machine Sandboxes for Distributed Computing in Wide-area Overlays of Virtual Workstations , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[25]  Otis Gospodnetic,et al.  Lucene in Action (In Action series) , 2004 .

[26]  Mário J. Silva,et al.  Searching and Archiving the Web with Tumba , 2003 .

[27]  New products , 1940, Electrical Engineering.

[28]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[29]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[32]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[33]  Brad Tofel ‘Wayback’ for Accessing Web Archives , 2007 .

[34]  Daniel Gomes,et al.  Web modelling for web warehouse design , 2007 .

[35]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .