A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., searchable forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more intelligent and effective solutions remain to be further explored. In this paper, a new architecture of an intelligent agent-based crawler (iCrawler) for domain-specific Deep Web databases has been proposed to address the limitations of the existing methods. The iCrawler, based on intelligent learning agents and domain ontology, and a series of novel and effective strategies, including a two-step page classifier, a link scoring strategy, etc, can improve the performance of the existing methods. Experiments of the iCrawler over a number of real Web pages in a set of representative domains have been conducted and the results show that the iCrawler outperforms the existing domain-specific Deep Web Form-Focused Crawlers (FFCs) in terms of the harvest rate, coverage rate and time performance.

[1]  Kevin Chen-Chuan Chang,et al.  A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases , 2006 .

[2]  Walid G. Aref,et al.  Databases deepen the Web , 2004, Computer.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[5]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[6]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[7]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[8]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[9]  Giles,et al.  Searching the world wide Web , 1998, Science.

[10]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[11]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[12]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[13]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[14]  Divakar Yadav,et al.  Topical web crawling using weighted anchor text and web page change detection techniques , 2009 .

[15]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[16]  Ali Syed,et al.  Focused web crawling using decay concept and genetic programming , 2011 .

[17]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[18]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[19]  N. P. Gopalan,et al.  An Architectural Framework of a Crawler for Locating Deep Web Repositories Using Learning Multi-agent Systems , 2008, 2008 Third International Conference on Internet and Web Applications and Services.