Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.

[1]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[2]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[3]  Melius Weideman,et al.  The influence that JavaScript™ has on the visibility of a Website to search engines - a pilot study , 2006, Inf. Res..

[4]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[5]  Arie van Deursen,et al.  Crawling AJAX by Inferring User Interface State Changes , 2008, 2008 Eighth International Conference on Web Engineering.

[6]  Victor Carneiro,et al.  Crawling the Content Hidden Behind Web Forms , 2007, ICCSA.

[7]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[8]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[9]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[10]  Marina L. Gavrilova,et al.  Computational Science and Its Applications - ICCSA 2007, International Conference, Kuala Lumpur, Malaysia, August 26-29, 2007. Proceedings, Part I , 2007, ICCSA.

[11]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[12]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[13]  Alberto Pan,et al.  Crawling Web Pages with Support for Client-Side Dynamism , 2006, WAIM.

[14]  Liu Wei,et al.  Deep Web , 2014, Encyclopedia of Social Network Analysis and Mining.

[15]  Kumar Chellapilla,et al.  A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.

[16]  Brian D. Davison,et al.  Detecting semantic cloaking on the web , 2006, WWW '06.