Harvesting All Matching Information To A Given Query From a Deep Website

In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance "Denzel Washington", "Iran Nuclear Deal", or "FC Barcelona" from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

[1]  Djoerd Hiemstra,et al.  Size estimation of non-cooperative data collections , 2012, IIWAS '12.

[2]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[3]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[4]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[5]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.

[6]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[7]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[8]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Yeye He,et al.  Crawling deep web entity pages , 2013, WSDM.

[11]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[12]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[13]  Michael J. Cafarella Extracting and Querying a Comprehensive Web Database , 2009, CIDR.

[14]  Victor Carneiro,et al.  DeepBot: a focused crawler for accessing hidden web content , 2007, DEECS '07.

[15]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..