Efficient web harvesting strategies for monitoring deep web content

Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.

[1]  Djoerd Hiemstra,et al.  Size estimation of non-cooperative data collections , 2012, IIWAS '12.

[2]  Toshihide Ibaraki,et al.  Resource allocation problems - algorithmic approaches , 1988, MIT Press series in the foundations of computing.

[3]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[4]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[5]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[6]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[7]  Djoerd Hiemstra,et al.  Deep web entity monitoring , 2013, WWW '13 Companion.

[8]  Jeffrey Scott Vitter,et al.  Characterizing Web Document Change , 2001, WAIM.

[9]  Djoerd Hiemstra,et al.  Towards complete coverage in focused web harvesting , 2015, iiWAS.

[10]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[11]  Djoerd Hiemstra,et al.  FedWeb Greatest Hits: Presenting the New Test Collection for Federated Web Search , 2015, WWW.

[12]  Sang-goo Lee,et al.  Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07) , 2007 .

[13]  Swati Mali Focused Web Crawler with Page Change Detection Policy , 2011 .

[14]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[15]  Michael J. Cafarella Extracting and Querying a Comprehensive Web Database , 2009, CIDR.

[16]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[17]  Djoerd Hiemstra,et al.  Overview of the TREC 2014 Federated Web Search Track , 2013, TREC.

[18]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[19]  Naresh Kumar,et al.  A Survey on Reduction of Load on the Network , 2014, ISI.

[20]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[21]  Djoerd Hiemstra,et al.  Harvesting All Matching Information To A Given Query From a Deep Website , 2015, KDWeb.

[22]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[23]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[24]  V. Kamakshi Prasad,et al.  WEB CONTENT MINING TOOLS: A COMPARATIVE STUDY , 2011 .

[25]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[26]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[27]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[28]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[29]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[30]  Y. Syed Mudhasir Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search , 2012 .

[31]  Zvi Galil,et al.  A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort , 1979, JACM.

[32]  Victor Carneiro,et al.  DeepBot: a focused crawler for accessing hidden web content , 2007, DEECS '07.

[33]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[34]  Kevin Chen-Chuan Chang,et al.  Editorial: special issue on web content mining , 2004, SKDD.

[35]  Ricardo A. Baeza-Yates,et al.  Web Dynamics, Structure, and Page Quality , 2004, Web Dynamics.

[36]  Serge Abiteboul Issues in Monitoring Web Data , 2002, DEXA.

[37]  Robert Boncella,et al.  Competitive Intelligence and the Web , 2003, Commun. Assoc. Inf. Syst..

[38]  Yeye He,et al.  Crawling deep web entity pages , 2013, WSDM.