Using high performance systems to build collections for a digital library

Nothing is more distributed than the Web, with its content spread across thousands of servers. High performance hardware and software is essential for an effective download, analysis, and organization of this content. We describe our experience with a highly parallel Web crawling system (Mercator) to construct - automatically - collections of scientific resources for the National Science Digital Library.

[1]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[2]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[3]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[4]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[5]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[6]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[8]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[9]  Carl Lagoze,et al.  Core services in the architecture of the national science digital library (NSDL) , 2002, JCDL '02.

[10]  Carl Lagoze,et al.  Core Services in the Architecture of the National Digital Library for Science Education (NSDL) , 2002, ArXiv.

[11]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[12]  Vitaliy V. Kluev Compiling document collections from the Internet , 2000, SIGF.

[13]  Donna Bergmark,et al.  Collection synthesis , 2002, JCDL '02.

[14]  William Y. Arms Automated Digital Libraries: How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship? , 2000, D Lib Mag..

[15]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.