论文信息 - Mining a search engine's corpus without a query pool

Mining a search engine's corpus without a query pool

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface. Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a ``good'' query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.

Gautam Das | Nan Zhang | Mingyang Zhang

[1] D. Robson,et al. Sample Size in Petersen Mark–Recapture Experiments , 1964 .

[2] Gautam Das,et al. Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4] Wei-Ying Ma,et al. Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5] Ivana Černá,et al. Enhancing random walk state space exploration , 2005, FMICS '05.

[6] Shyhtsun Felix Wu,et al. Estimating the Size of Online Social Networks , 2010, 2010 IEEE Second International Conference on Social Computing.

[7] Ziv Bar-Yossef,et al. Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[8] Gautam Das,et al. Aggregate suppression for enterprise search engines , 2012, SIGMOD Conference.

[9] Edo Liberty,et al. Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[10] Kristoffer Sahlin,et al. Estimating convergence of Markov chain Monte Carlo simulations , 2011 .

[11] Ziv Bar-Yossef,et al. Efficient search engine measurements , 2007, WWW '07.

[12] Donald F. Towsley,et al. Sampling directed graphs with random walks , 2012, 2012 Proceedings IEEE INFOCOM.

[13] L. Asz. Random Walks on Graphs: a Survey , 2022 .

[14] Xin Jin,et al. Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[15] Ziv Bar-Yossef,et al. Estimating the impressionrank of web pages , 2009, WWW '09.

[16] Xin Jin,et al. Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[17] Heikki Mannila,et al. A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[18] Luis Gravano,et al. Modeling Query-Based Access to Text Databases , 2003, WebDB.

[19] Petros Zerfos,et al. Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[20] Gautam Das,et al. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[21] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.