A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

The Web has been rapidly “deepened” by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is “hidden” in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this paper, we develop a novel Web Form Crawler to collect the “doors” of Web databases, i.e., query forms, to build a database for online databases in both efficient and comprehensive manners. Being object-focused, topic-neutral and coveragecomprehensive, such a crawler, while critical to searching and integrating online databases, has not been extensively studied. In particular, query forms, while many, when compared with the size of the Web, are sparsely scattered among pages, which brings new challenges for focused crawling: First, due to the topic-neutral nature of our crawling problem, we cannot rely on existing topicfocused crawling techniques. Second, existing focused crawling cannot achieve the comprehensiveness requirement because it is not able to be aware of the coverage of crawled content. As a new attempt, we propose a structure-driven crawling framework by observing structure locality of query forms– That is, query forms are often close to root pages of Web sites and accessible by following navigational links. Exploring this structure locality, we substantiate the structure-driven crawling framework into a site-based Web Form Crawler by first collecting the site entrances, as the Site Finder, and then searching for query forms within the scope of each site, as the Form Finder. Analytical justification and empirical evaluation of the Web Form Crawler both show that: 1) our crawler can maintain stable harvest and coverage throughout the crawling, and 2) compared to page-based crawling, our best harvest rate is about 10 to 400 times better, depending on the page traversal schemes used.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Kevin Chen-Chuan Chang,et al.  Making holistic schema matching robust: an ensemble approach , 2005, KDD '05.

[3]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[4]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[5]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[6]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[7]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[10]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[11]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[12]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[13]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[14]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[15]  Clement T. Yu,et al.  Automatic extraction of web search interfaces for interface schema integration , 2004, WWW Alt. '04.

[16]  Ricardo A. Baeza-Yates,et al.  Balancing Volume, Quality and Freshness in Web Crawling , 2002, HIS.

[17]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[18]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[19]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[20]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[21]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[22]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[23]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[24]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[25]  Kevin Chen-Chuan Chang,et al.  Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly , 2005, VLDB.

[26]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[27]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[28]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.