Towards optimum query segmentation: in doubt without

Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem are threefold. (1) We conduct the first large-scale study of human segmentation behavior based on more than 500000 segmentations. (2) We show that the traditionally applied segmentation accuracy measures are not appropriate for such large-scale corpora and introduce new, more robust measures. (3) We develop a new query segmentation approach with the basic idea that, in cases of doubt, it is often better to (partially) leave queries without any segmentation. This new in-doubt-without approach chooses different segmentation strategies depending on query types. A large-scale evaluation shows substantial improvement upon the state of the art in terms of segmentation accuracy. To draw a complete picture, we also evaluate the impact of segmentation strategies on retrieval performance in a TREC setting. It turns out that more accurate segmentation not necessarily yields better retrieval performance. Based on this insight, we propose an in-doubt-without variant which achieves the best retrieval performance despite leaving many queries unsegmented. But there is still room for improvement: the optimum segmentation strategy which always chooses the segmentation that maximizes retrieval performance, significantly outperforms all other tested approaches.

[1]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[2]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[3]  W. Bruce Croft,et al.  Joint Annotation of Search Queries , 2011, ACL.

[4]  Wei Zhang,et al.  Recognition and classification of noun phrases in queries for effective retrieval , 2007, CIKM '07.

[5]  Klaus Berberich,et al.  Evaluating the Potential of Explicit Phrases for Retrieval Quality , 2010, ECIR.

[6]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[7]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[8]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[9]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.

[10]  Nan Sun,et al.  Query Segmentation Based on Eigenspace Similarity , 2009, ACL/IJCNLP.

[11]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[12]  W. Bruce Croft,et al.  Two-stage query segmentation for information retrieval , 2009, SIGIR.

[13]  Daniel Gayo-Avello,et al.  On the Fly Query Entity Decomposition Using Snippets , 2010, ArXiv.

[14]  W. Bruce Croft,et al.  Structural annotation of search queries using pseudo-relevance feedback , 2010, CIKM.

[15]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[16]  Matthias Hagen,et al.  The power of naive query segmentation , 2010, SIGIR '10.

[17]  Rishiraj Saha Roy,et al.  Unsupervised query segmentation using only query logs , 2011, WWW.

[18]  Dan Morris,et al.  Investigating the querying and browsing behavior of advanced search engine users , 2007, SIGIR.

[19]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[20]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.