Searching for Hidden-Web Databases

Recently, there has been increased interest in the retrieval and integration of hidden-Web data with a view to leverage high-quality information available in online databases. Although previous works have addressed many aspects of the actual integration, including matching form schemata and automatically filling out forms, the problem of locating relevant data sources has been largely overlooked. Given the dynamic nature of the Web, where data sources areconstantlychanging, itiscrucialtoautomaticallydiscoverthese resources. However, considering the number of documents on the Web (Google already indexes over 8 billion documents), automatically finding tens, hundreds or even thousands of forms that are relevant to the integration task is really like looking for a few needles in a haystack. Besides, since the vocabulary and structure of forms for a given domain are unknown until the forms are actually found, it is hard to define exactly what to look for. We propose a new crawling strategy to automatically locate hidden-Web databases which aims to achieve a balance between the two conflicting requirements of this problem: the need to perform a broad search while at the same time avoiding the need to crawl a large number of irrelevant pages. The proposed strategy does that by focusing the crawl on a given topic; by judiciously choosing links to follow within a topic that are more likely to lead to pages that contain forms; and by employing appropriate stopping criteria. We describe the algorithms underlying this strategy and an experimental evaluation which shows that our approach is both effective and efficient, leading to larger numbers of forms retrieved as a function of the number of pages visited than other crawlers.

[1]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[2]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[3]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[4]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[5]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[6]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[7]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[8]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[9]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[10]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[13]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[14]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[15]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[16]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[17]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.