Focused Crawls, Tunneling, and Digital Libraries

Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's, crawler technology having been developed for use by search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper covers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a "best-first" crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page's relevance score, but also estimating the value of each link and prioritizing them as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an effective tool for building digital libraries.

[1]  Carl Lagoze,et al.  Core Services in the Architecture of the National Digital Library for Science Education (NSDL) , 2002, ArXiv.

[2]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[3]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[4]  Nicholas J. Belkin,et al.  SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece , 1999 .

[5]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[6]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[7]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[8]  Donna Bergmark,et al.  Collection synthesis , 2002, JCDL '02.

[9]  William Y. Arms Automated Digital Libraries: How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship? , 2000, D Lib Mag..

[10]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[11]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[12]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[13]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[14]  Vitaliy V. Kluev Compiling document collections from the Internet , 2000, SIGF.

[15]  W.-S. Li,et al.  Topic distillation on hierarchically categorized Web documents , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[16]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[17]  Carl Lagoze,et al.  Core services in the architecture of the national science digital library (NSDL) , 2002, JCDL '02.

[18]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[19]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[20]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[21]  Lee L. Zia,et al.  The NSF national science, mathematics, engineering, and technology education digital library program , 2001, CACM.

[22]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[23]  Donna Bergmark Using high performance systems to build collections for a digital library , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[24]  Soumen Chakrabarti,et al.  Recent results in automatic Web resource discovery , 1999, CSUR.

[25]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[26]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.