Evaluating topic-driven web crawlers

Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[6]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[7]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[8]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[9]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[10]  Israel Ben-Shaul,et al.  Adding Support for Dynamic and Focused Search with Fetuccino , 1999, Comput. Networks.

[11]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[12]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[13]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[14]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[15]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[16]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[17]  Berthier A. Ribeiro-Neto,et al.  Link-based and content-based evidential information in a belief network model , 2000, SIGIR '00.

[18]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[21]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[22]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .