Mining a search engine's corpus without a query pool

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface. Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.

[1]  D. Robson,et al.  Sample Size in Petersen Mark–Recapture Experiments , 1964 .

[2]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Ivana Černá,et al.  Enhancing random walk state space exploration , 2005, FMICS '05.

[6]  Shyhtsun Felix Wu,et al.  Estimating the Size of Online Social Networks , 2010, 2010 IEEE Second International Conference on Social Computing.

[7]  Ziv Bar-Yossef,et al.  Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[8]  Gautam Das,et al.  Aggregate suppression for enterprise search engines , 2012, SIGMOD Conference.

[9]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[10]  Kristoffer Sahlin,et al.  Estimating convergence of Markov chain Monte Carlo simulations , 2011 .

[11]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[12]  Donald F. Towsley,et al.  Sampling directed graphs with random walks , 2012, 2012 Proceedings IEEE INFOCOM.

[13]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[14]  Xin Jin,et al.  Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[15]  Ziv Bar-Yossef,et al.  Estimating the impressionrank of web pages , 2009, WWW '09.

[16]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[17]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[18]  Luis Gravano,et al.  Modeling Query-Based Access to Text Databases , 2003, WebDB.

[19]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[20]  Gautam Das,et al.  Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[21]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.