Sampling Search-Engine Results

AbstractWe consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: Determining the set of categories in a given taxonomy spanned by the search results;Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”;Estimating the size of the result set;Data mining associations to the query terms.We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

[1]  Kevin Li,et al.  Faceted metadata for image search and browsing , 2003, CHI '03.

[2]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[3]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[4]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[5]  David Carmel,et al.  Juru at TREC 10 - Experiments with Index Pruning , 2001, TREC.

[6]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[7]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[8]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[9]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[10]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[11]  Marcus Fontoura,et al.  High Performance Index Build Algorithms for Intranet Search Engines , 2004, VLDB.

[12]  David Carmel,et al.  Scaling IR-system evaluation using term relevance sets , 2004, SIGIR '04.

[13]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[14]  Philip S. Yu,et al.  On using partial supervision for text categorization , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[16]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[17]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[18]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[19]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[20]  Dragomir R. Radev,et al.  Mining the web for answers to natural language questions , 2001, CIKM '01.

[21]  James P. Bagrow,et al.  On the Google‐fame of scientists and other populations , 2005 .

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.