Entropy of search logs: how hard is search? with personalization? with backoff?

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions. Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs. What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography

[1]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[2]  Sourav S. Bhowmick,et al.  A survey of Web metrics , 2002, CSUR.

[3]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[4]  ChengXiang Zhai,et al.  Mining long-term search history to improve search accuracy , 2006, KDD '06.

[5]  Giles,et al.  Searching the world wide Web , 1998, Science.

[6]  Gary Marchionini,et al.  Examining the effectiveness of real-time query expansion , 2007, Inf. Process. Manag..

[7]  Rosie Jones,et al.  Query word deletion prediction , 2003, SIGIR.

[8]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[9]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[10]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[11]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[12]  Kenneth Ward Church,et al.  The Wild Thing , 2005, ACL.

[13]  Susan T. Dumais,et al.  Personalizing Search via Automated Analysis of Interests and Activities , 2005, SIGIR.

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  R. Armstrong The Long Tail: Why the Future of Business Is Selling Less of More , 2008 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Carl Sagan,et al.  Billions and Billions: Thoughts on Life and Death at the Brink of the Millennium , 1997 .

[18]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[19]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[20]  Corinna Cortes,et al.  Signature-Based Methods for Data Streams , 2001, Data Mining and Knowledge Discovery.

[21]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[22]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[23]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[24]  Masatoshi Yoshikawa,et al.  Adaptive web search based on user profile constructed without any effort from users , 2004, WWW '04.

[25]  Robert G. Gallager,et al.  Claude E. Shannon: A retrospective on his life, work, and impact , 2001, IEEE Trans. Inf. Theory.

[26]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[27]  ChengXiang Zhai,et al.  UCAIR: a personalized search toolbar , 2005, SIGIR '05.

[28]  Xuehua Shen,et al.  Context-sensitive information retrieval using implicit feedback , 2005, SIGIR '05.

[29]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .