Discovering and understanding word level user intent in Web search queries

Identifying and interpreting user intent are fundamental to semantic search. In this paper, we investigate the association of intent with individual words of a search query. We propose that words in queries can be classified as either content or intent, where content words represent the central topic of the query, while users add intent words to make their requirements more explicit. We argue that intelligent processing of intent words can be vital to improving the result quality, and in this work we focus on intent word discovery and understanding. Our approach towards intent word detection is motivated by the hypotheses that query intent words satisfy certain distributional properties in large query logs similar to function words in natural language corpora. Following this idea, we first prove the effectiveness of our corpus distributional features, namely, word co-occurrence counts and entropies, towards function word detection for five natural languages. Next, we show that reliable detection of intent words in queries is possible using these same features computed from query logs. To make the distinction between content and intent words more tangible, we additionally provide operational definitions of content and intent words as those words that should match, and those that need not match, respectively, in the text of relevant documents. In addition to a standard evaluation against human annotations, we also provide an alternative validation of our ideas using clickthrough data. Concordance of the two orthogonal evaluation approaches provide further support to our original hypothesis of the existence of two distinct word classes in search queries. Finally, we provide a taxonomy of intent words derived through rigorous manual analysis of large query logs.

[1]  Marius Pasca,et al.  Low-Cost Supervision for Multiple-Source Attribute Extraction , 2009, CICLing.

[2]  Amanda Spink,et al.  Determining the informational, navigational, and transactional intent of Web queries , 2008, Inf. Process. Manag..

[3]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[4]  Rishiraj Saha Roy,et al.  An IR-based evaluation framework for web search query segmentation , 2012, SIGIR '12.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Fuji Ren,et al.  Role-explicit query identification and intent role annotation , 2012, CIKM '12.

[7]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[8]  Enrique Alfonseca,et al.  Acquisition of instance attributes via labeled and related instances , 2010, SIGIR.

[9]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[10]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[11]  Min-Yen Kan,et al.  Functional Faceted Web Query Analysis , 2007 .

[12]  Xiao Li,et al.  Automatic extraction of clickable structured web contents for name entity queries , 2010, WWW '10.

[13]  J. Koenderink Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[14]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.

[15]  Pabitra Mitra,et al.  Feature weighting in content based recommendation system using social network analysis , 2008, WWW.

[16]  Mark A. Musen,et al.  Semantic Wiki Search , 2009, ESWC.

[17]  Michael Gamon,et al.  Active objects: actions for entity-centric search , 2012, WWW.

[18]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[19]  W. Bruce Croft,et al.  Two-stage query segmentation for information retrieval , 2009, SIGIR.

[20]  Xiao Li,et al.  Understanding the Semantic Structure of Noun Phrase Queries , 2010, ACL.

[21]  Éric Guichard L'internet : mesures des appropriations d'une technique intellectuelle , 2002 .

[22]  Ryen W. White,et al.  Search, interrupted: understanding and predicting search task continuation , 2012, SIGIR '12.

[23]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[24]  Marius Pasca,et al.  Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies , 2009, EACL.

[25]  Rishiraj Saha Roy,et al.  ARE WEB SEARCH QUERIES AN EVOLVING PROTOLANGUAGE , 2012 .

[26]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[27]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[28]  Marco Pennacchiotti,et al.  Open Entity Extraction from Web Search Query Logs , 2010, COLING.

[29]  Ricardo Baeza-Yates,et al.  A Multi-faceted Approach to Query Intent Classification , 2011, SPIRE.

[30]  Yong Yu,et al.  Viewing Term Proximity from a Different Perspective , 2008, ECIR.

[31]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[32]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[33]  Xin Li,et al.  Investigation of partial query proximity in web search , 2008, WWW.

[34]  Charles L.A. Clarke,et al.  SIGIR '07, Amsterdam : proceedings : 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 23-27, 2007, Amsterdam, the Netherlands , 2007 .

[35]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[36]  Duygu Tümer,et al.  An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia , 2009, 2009 Fourth International Conference on Internet Monitoring and Protection.

[37]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[38]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[39]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[40]  Marius Pasca,et al.  Attribute Extraction from Synthetic Web Search Queries , 2011, IJCNLP.

[41]  Benjamin Van Durme,et al.  Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction , 2008, AAAI.

[42]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[43]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[44]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[45]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[46]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[47]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[48]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[49]  Xiaoxin Yin,et al.  Building taxonomy of web search intents for name entity queries , 2010, WWW '10.

[50]  Ronan Cummins,et al.  Learning in a pairwise term-term proximity framework for information retrieval , 2009, SIGIR.

[51]  James L. Morgan,et al.  Signal to syntax : bootstrapping from speech to grammar in early acquisition , 1996 .

[52]  Rishiraj Saha Roy,et al.  Complex Network Analysis Reveals Kernel-Periphery Structure in Web Search Queries , 2011 .

[53]  Xiao Li,et al.  Semantic Tagging of Web Search Queries , 2009, ACL.

[54]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[55]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[56]  Sébastien Ferré,et al.  Semantic Search: Reconciling Expressive Querying and Exploratory Search , 2011, SEMWEB.