Towards complete coverage in focused web harvesting

With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance "Michael Jackson", "Islamic State", or "FC Barcelona" from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

[1]  Sang-goo Lee,et al.  Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07) , 2007 .

[2]  Victor Carneiro,et al.  DeepBot: a focused crawler for accessing hidden web content , 2007, DEECS '07.

[3]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[4]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.

[5]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[6]  Michael J. Cafarella Extracting and Querying a Comprehensive Web Database , 2009, CIDR.

[7]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[8]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[9]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[10]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[11]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[12]  Terry Ballard,et al.  Google Custom Search , 2012 .

[13]  Yeye He,et al.  Crawling deep web entity pages , 2013, WSDM.

[14]  Djoerd Hiemstra,et al.  Harvesting All Matching Information To A Given Query From a Deep Website , 2015, KDWeb.

[15]  Djoerd Hiemstra,et al.  Size estimation of non-cooperative data collections , 2012, IIWAS '12.

[16]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[17]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[18]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.