Learning a unified embedding space of web search from large-scale query log

Abstract In the procedure of Web search, a user first comes up with an information need and a query is issued with the need as guidance. After that, some URLs are clicked and other queries may be issued if those URLs do not meet his need well. We advocate that Web search is governed by a unified hidden space, and each involved element such as query and URL has its inborn position, i.e., projected as a vector, in this space. Each of above actions in the search procedure, i.e. issuing queries or clicking URLs, is an interaction result of those elements in the space. In this paper, we aim at uncovering such a unified hidden space of Web search that uniformly captures the hidden semantics of search queries, URLs and other involved elements in Web search. We learn the semantic space with search session data, because a search session can be regarded as an instantiation of users’ information need on a particular semantic topic and it keeps the interaction information of queries and URLs. We use a set of session graphs to represent search sessions, and the space learning task is cast as a vector learning problem for the graph vertices by maximizing the log-likelihood of a training session data set. Specifically, we developed the well-known Word2vec to perform the learning procedure. Experiments on the query log data of a commercial search engine are conducted to examine the efficacy of learnt vectors, and the results show that our framework is helpful for different finer tasks in Web search.

[1]  Qinghua Zheng,et al.  Mining query subtopics from search log data , 2012, SIGIR '12.

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Marta R. Costa-jussà,et al.  Continuous space language models for the IWSLT 2006 task , 2006, IWSLT.

[4]  Wei Wu,et al.  Learning query and document similarities from click-through bipartite graph with metadata , 2013, WSDM.

[5]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[6]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[7]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[10]  Lidong Bing,et al.  Web page segmentation with structured prediction and its application in web page classification , 2014, SIGIR.

[11]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[12]  ChengXiang Zhai,et al.  Unsupervised identification of synonymous query intent templates for attribute intents , 2013, CIKM.

[13]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[16]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[17]  Ryen W. White,et al.  Playing by the rules: mining query associations to predict search performance , 2013, WSDM.

[18]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[19]  Lidong Bing,et al.  Using query log and social tagging to refine queries based on latent topics , 2011, CIKM '11.

[20]  Jianfeng Gao,et al.  Clickthrough-based latent semantic models for web search , 2011, SIGIR.

[21]  Ricardo A. Baeza-Yates,et al.  The Intention Behind Web Queries , 2006, SPIRE.

[22]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[23]  Quan Z. Sheng,et al.  Query Performance Prediction on Knowledge Base , 2018 .

[24]  Yiqun Liu,et al.  Query Ambiguity Identification Based on User Behavior Information , 2014, AIRS.

[25]  Jiawei Han,et al.  Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts , 2014, WSDM.

[26]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[27]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[28]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Jianfeng Gao,et al.  Clickthrough-based translation models for web search: from word models to phrase models , 2010, CIKM.

[31]  Luca Becchetti,et al.  An optimization framework for query recommendation , 2010, WSDM '10.

[32]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[33]  Kevin Chen-Chuan Chang,et al.  Towards rich query interpretation: walking back and forth for mining query templates , 2010, WWW '10.

[34]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[35]  Yelong Shen,et al.  Sparse hidden-dynamics conditional random fields for user intent understanding , 2011, WWW.

[36]  Jiawei Han,et al.  Learning search tasks in queries and web pages via graph regularization , 2011, SIGIR '11.

[37]  Lu Wang,et al.  Clustering query refinements by user intent , 2010, WWW '10.

[38]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[39]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[40]  Yong Yu,et al.  Identification of ambiguous queries in web search , 2009, Inf. Process. Manag..

[41]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[42]  Ricardo Baeza-Yates,et al.  Improved query difficulty prediction for the web , 2008, CIKM '08.

[43]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[44]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[45]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[46]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[47]  Lidong Bing,et al.  Learning a Semantic Space of Web Search via Session Data , 2016, AIRS.

[48]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[49]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[50]  Jackie Chi Kit Cheung,et al.  Sequence clustering and labeling for unsupervised query intent discovery , 2012, WSDM '12.

[51]  Lidong Bing,et al.  Web Query Reformulation via Joint Modeling of Latent Topic Dependency and Term Context , 2015, TOIS.

[52]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.