An adaptive crawler for locating hidden-Web entry points

In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

[1]  Wilson C. Hsieh,et al.  Data management projects at Google , 2006, SIGMOD Conference.

[2]  Juliana Freire,et al.  Combining classifiers to identify online databases , 2007, WWW '07.

[3]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[4]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[6]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[7]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[9]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[10]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[13]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[14]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[15]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2004, WIDM '04.

[16]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[17]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[18]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[19]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[20]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[21]  DunningTed Accurate methods for the statistics of surprise and coincidence , 1993 .

[22]  King-Lup Liu,et al.  A Methodology to Retrieve Text Documents from Multiple Databases , 2002, IEEE Trans. Knowl. Data Eng..

[23]  Edward Y. Chang,et al.  Data management projects at Google , 2008, SGMD.

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[26]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[27]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[28]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[29]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..