Analysis of long queries in a large scale search log

We propose to use the search log to study long queries, in order to understand the types of information needs that are behind them, and to design techniques to improve search effectiveness when they are used. Long queries arise in many different applications, such as CQA (community-based question answering) and literature search, and they have been studied to some extent using TREC data. They are also, however, quite common in web search, as can be seen by looking at the distribution of query lengths in a large scale search log. In this paper we analyze the long queries in the search log with the aim of identifying the characteristics of the most commonly occurring types of queries, and the issues involved with using them effectively in a search engine. In addition, we propose a simple yet effective method for evaluating the performance of the queries in the search log using a combination of the click data in the search log with the existing TREC corpora.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[3]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[4]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[5]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[6]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[7]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[8]  Gary Marchionini,et al.  Find What You Need, Understand What You Find , 2007, Int. J. Hum. Comput. Interact..

[9]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[10]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[11]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[12]  James Allan,et al.  Effective and efficient user interaction for long queries , 2008, SIGIR '08.

[13]  James Allan,et al.  A Case For Shorter Queries, and Helping Users Create Them , 2007, NAACL.

[14]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[15]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[16]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[17]  Michael R. Lyu,et al.  Learning latent semantic relations from clickthrough data for query suggestion , 2008, CIKM '08.

[18]  Kenneth Ward Church,et al.  Entropy of search logs: how hard is search? with personalization? with backoff? , 2008, WSDM '08.

[19]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[20]  W. Bruce Croft,et al.  Retrieval models for question and answer archives , 2008, SIGIR '08.

[21]  Fuchun Peng,et al.  Analyzing web text association to disambiguate abbreviation in queries , 2008, SIGIR '08.

[22]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[23]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[24]  David Hawking,et al.  Challenges in Enterprise Search , 2004, ADC.

[25]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[26]  Doug Downey,et al.  Understanding the relationship between searchers' queries and information goals , 2008, CIKM '08.

[27]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[28]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[29]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[30]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[31]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[32]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[33]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[34]  Peter Bailey,et al.  Understanding the relationship of information need specificity to search query length , 2007, SIGIR.

[35]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.