Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information of analytical interest to third parties, but the only available access is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem, as the estimation error and query cost can both be large for certain aggregates. We propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.

[1]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[2]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[3]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[5]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[6]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[7]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[8]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[9]  M. Ruiz Espejo Sampling , 2013, Encyclopedic Dictionary of Archaeology.

[10]  Bar-YossefZiv,et al.  Mining search engine query logs via suggestion sampling , 2008, VLDB 2008.

[11]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Luis Gravano,et al.  Modeling Query-Based Access to Text Databases , 2003, WebDB.

[13]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[14]  Peter Tryfos,et al.  Sampling Methods for Applied Research: Text and Cases , 1996 .

[15]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[16]  Ziv Bar-Yossef,et al.  Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.