Search Engine Query Clustering Using Top-k Search Results

Clustering of search engine queries has attracted significant attention in recent years. Many search engine applications such as query recommendation require query clustering as a pre-requisite to function properly. Indeed, clustering is necessary to unlock the true value of query logs. However, clustering search queries effectively is quite challenging, due to the high diversity and arbitrary input by users. Search queries are usually short and ambiguous in terms of user requirements. Many different queries may refer to a single concept, while a single query may cover many concepts. Existing prevalent clustering methods, such as K-Means or DBSCAN cannot assure good results in such a diverse environment. Agglomerative clustering gives good results but is computationally quite expensive. This paper presents a novel clustering approach based on a key insight -- search engine results might themselves be used to identify query similarity. We propose a novel similarity metric for diverse queries based on the ranked URL results returned by a search engine for queries. This is used to develop a very efficient and accurate algorithm for clustering queries. Our experimental results demonstrate more accurate clustering performance, better scalability and robustness of our approach against known baselines.

[1]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[2]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[5]  Ian F. C. Smith,et al.  A Bounded Index for Cluster Validity , 2007, MLDM.

[6]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[7]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[8]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[9]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[10]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[11]  Reiner Kraft,et al.  Contextual Ranking of Keywords Using Click Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[14]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[17]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[18]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[19]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.