Behavior-driven clustering of queries into topics

Categorization of web-search queries in semantically coherent topics is a crucial task to understand the interest trends of search engine users and, therefore, to provide more intelligent personalization services. Query clustering usually relies on lexical and clickthrough data, while the information originating from the user actions in submitting their queries is currently neglected. In particular, the intent that drives users to submit their requests is an important element for meaningful aggregation of queries. We propose a new intent-centric notion of topical query clusters and we define a query clustering technique that differs from existing algorithms in both methodology and nature of the resulting clusters. Our method extracts topics from the query log by merging missions, i.e., activity fragments that express a coherent user intent, on the basis of their topical affinity. Our approach works in a bottom-up way, without any a-priori knowledge of topical categorization, and produces good quality topics compared to state-of-the-art clustering techniques. It can also summarize topically-coherent missions that occur far away from each other, thus enabling a more compact user profiling on a topical basis. Furthermore, such a topical user profiling discriminates the stream of activity of a particular user from the activity of others, with a potential to predict future user search activity.

[1]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[2]  Wei Song,et al.  Bridging Topic Modeling and Personalized Search , 2010, COLING.

[3]  Eugénio C. Oliveira,et al.  Efficient Clustering of Web-Derived Data Sets , 2009, MLDM.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[7]  Francesco Bonchi,et al.  Do you want to take notes?: identifying research missions in Yahoo! search pad , 2010, WWW '10.

[8]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[9]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Ricardo A. Baeza-Yates,et al.  Graphs from Search Engine Queries , 2007, SOFSEM.

[12]  Peter G. Anick Using terminological feedback for web search refinement: a log-based study , 2003, SIGIR.

[13]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[14]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[15]  Aristides Gionis,et al.  The query-flow graph: model and applications , 2008, CIKM '08.

[16]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[17]  Xiaofei He,et al.  Regularized query classification using search click information , 2008, Pattern Recognit..

[18]  Sreenivas Gollapudi,et al.  Exploiting asymmetry in hierarchical topic extraction , 2006, CIKM '06.

[19]  Niranjan Balasubramanian,et al.  Automatic generation of topic pages using query-based aspect models , 2009, CIKM.

[20]  Farzin Maghoul,et al.  Query clustering using click-through graph , 2009, WWW '09.

[21]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[22]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[23]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[24]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[25]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[26]  Aristides Gionis,et al.  Query similarity by projecting the query-flow graph , 2010, SIGIR.

[27]  Santo Fortunato,et al.  Finding Statistically Significant Communities in Networks , 2010, PloS one.

[28]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[29]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[30]  Parthasarathy Ramachandran,et al.  Intent based clustering of search engine query log , 2009, 2009 IEEE International Conference on Automation Science and Engineering.

[31]  Ulrik Brandes,et al.  On Finding Graph Clusterings with Maximum Modularity , 2007, WG.

[32]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[33]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[34]  Bruce A. Reed,et al.  A Critical Point for Random Graphs with a Given Degree Sequence , 1995, Random Struct. Algorithms.

[35]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[36]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.