Towards Web-scale Web Archaeology

Web-scale Web research is difficult. Information on the Web is vast in quantity, unorganized and uncatalogued, and available only over a network with varying reliability. Thus, Web data is difficult to collect, to store, and to manipulate efficiently. Despite these difficulties, we believe performing Web research at Web-scale is important. We have built a suite of tools that allow us to experiment on collections that are an order of magnitude or more larger than are typically cited in the literature. Two key components of our current tool suite are a fast, extensible Web crawler and a highly tuned, in-memory database of connectivity information. A Web page repository that supports easy access to and storage for billions of documents would allow us to study larger data sets and to study how the Web evolves over time.

[1]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[2]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[3]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[4]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[5]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[6]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[7]  Deborah S. Ray,et al.  The AltaVista Search Revolution , 1997 .

[8]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[9]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[10]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[11]  Michael J. Swain,et al.  SpeechBot: a Speech Recognition based Audio Indexing System for the Web , 2000, RIAO.

[12]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[13]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[14]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[15]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[16]  Krishna Bharat,et al.  The Term Vector Database: fast access to indexing terms for Web pages , 2000, Comput. Networks.

[17]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[18]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[19]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[20]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.