E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.

[1]  N. P. Gopalan,et al.  An Architectural Framework of a Crawler for Locating Deep Web Repositories Using Learning Multi-agent Systems , 2008, 2008 Third International Conference on Internet and Web Applications and Services.

[2]  Ali Syed,et al.  Focused web crawling using decay concept and genetic programming , 2011 .

[3]  Ge Yu,et al.  Domain-oriented Deep Web Data Sources' Discovery and Identification , 2010, 2010 12th International Asia-Pacific Web Conference.

[4]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[5]  James A. Hall,et al.  The Sarbanes-Oxley Act: Implications for large-scale IT outsourcing , 2007, Commun. ACM.

[6]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[7]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[9]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[10]  Giles,et al.  Searching the world wide Web , 1998, Science.

[11]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[12]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[13]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[14]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[15]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[16]  Walid G. Aref,et al.  Databases deepen the Web , 2004, Computer.

[17]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[18]  G. G. Meyer,et al.  Lecture notes in business information processing , 2009 .

[19]  Tao Tao,et al.  Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach , 2004, EDBT Workshops.

[20]  Wanli Zuo,et al.  Domain-Specific Deep Web Sources Discovery , 2008, 2008 Fourth International Conference on Natural Computation.

[21]  Kevin Chen-Chuan Chang,et al.  A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases , 2006 .

[22]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[23]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[24]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[25]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[26]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[27]  Georg Lausen,et al.  Mashups over the Deep Web , 2008, WEBIST.

[28]  Chun Chen,et al.  On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis , 2009 .

[29]  Jianguo Lu,et al.  Crawling Deep Web Using a New Set Covering Algorithm , 2009, ADMA.

[30]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[31]  Clement T. Yu,et al.  WISE-cluster: clustering e-commerce search engines automatically , 2004, WIDM '04.

[32]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[33]  Divakar Yadav,et al.  Topical web crawling using weighted anchor text and web page change detection techniques , 2009 .

[34]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[35]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[36]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[37]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[38]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[39]  Kevin Chen-Chuan Chang,et al.  Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly , 2005, VLDB.

[40]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[41]  Daniel Rocco,et al.  Focused Crawling of the Deep Web Using Service Class Descriptions , 2004 .

[42]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[43]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[44]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..