Accurate and efficient query clustering via top ranked search results

To make the search engine more user-friendly, commercial search engines commonly develop applications to provide suggestion or recommendation for every posed query. Clustering semantically similar queries acts as an essential prerequisite to function well in those applications. However, clustering queries effectively is quite challenging, since they are usually short, incomplete and ambiguous. Existing prevalent clustering methods, such as K-Means or DBSCAN cannot guarantee good performance in such a highly dimensional environment. Through analyzing users’ click-through query logs, hierarchical agglomerative clustering gives good results but is computationally quite expensive. This paper identifies a novel feature for clustering search queries based on a key insight – queries’ top ranked search results can themselves be used to quantify query similarity. After investigating such feature, we propose a new similarity metric for comparing those diverse queries. This facilitates us to develop two very efficient and accurate algorithms integrated in query clustering. We conduct comprehensive experiments to compare the accuracy of our approach against the known baselines along two dimensions: 1) quantifying the cohesion/separation of clustered queries, and 2) justifying the results by real-world Internet users. The experimental results demonstrate that our two algorithms and the similarity metric can generate more accurate results within a significantly shorter time.

[1]  Nivio Ziviani,et al.  Using association rules to discover search engines related queries , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[2]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Philipp Cimiano,et al.  Evaluation of a Layered Approach to Question Answering over Linked Data , 2012, International Semantic Web Conference.

[5]  Reiner Kraft,et al.  Contextual Ranking of Keywords Using Click Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Schubert Foo,et al.  Collaborative Querying through a Hybrid Query Clustering Approach , 2003, ICADL.

[7]  Ian F. C. Smith,et al.  A Bounded Index for Cluster Validity , 2007, MLDM.

[8]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[9]  Farzin Maghoul,et al.  Query clustering using click-through graph , 2009, WWW '09.

[10]  Shui-Lung Chuang,et al.  Towards automatic generation of query taxonomy: a hierarchical query clustering approach , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Gloria Bordogna,et al.  A language for manipulating clustered web documents results , 2008, CIKM '08.

[12]  Kenneth Wai-Ting Leung,et al.  Personalized Concept-Based Clustering of Search Engine Queries , 2008, IEEE Transactions on Knowledge and Data Engineering.

[13]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[14]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[15]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[16]  Georges Dupret,et al.  Automatic Query Recommendation using Click-Through Data , 2006, IFIP PPAI.

[17]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[18]  Ricardo A. Baeza-Yates,et al.  Improving search engines by query clustering , 2007, J. Assoc. Inf. Sci. Technol..

[19]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[20]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[21]  T. E. Doszkocs,et al.  Searching MEDLINE in English: a prototype user interface with natural language query, ranked output, and relevance feedback , 1979 .

[22]  Isabelle Augenstein,et al.  Mapping Keywords to Linked Data Resources for Automatic Query Expansion , 2013, KNOW@LOD.

[23]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[24]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[25]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[26]  Vijayalakshmi Atluri,et al.  Effective anonymization of query logs , 2009, CIKM.

[27]  Yuxiang Sun,et al.  Identifying a hierarchy of bipartite subgraphs for web site abstraction , 2007, Web Intell. Agent Syst..

[28]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[29]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[30]  Toshiko Wakaki,et al.  A study on rough set-aided feature selection for automatic web-page classification , 2006, Web Intell. Agent Syst..

[31]  Ben He,et al.  High performance query expansion using adaptive co-training , 2013, Inf. Process. Manag..

[32]  Christopher C. Yang,et al.  Mining related queries from search engine query logs , 2006, WWW '06.

[33]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[34]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[35]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[36]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[37]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[38]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[39]  Jaideep Vaidya,et al.  Differentially private search log sanitization with optimal output utility , 2011, EDBT '12.

[40]  Bamshad Mobasher,et al.  Data Mining for Web Personalization , 2007, The Adaptive Web.

[41]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[42]  Sanjay Goel,et al.  Collaborative Search Log Sanitization: Toward Differential Privacy and Boosted Utility , 2015, IEEE Transactions on Dependable and Secure Computing.

[43]  Haibing Lu,et al.  Search Engine Query Clustering Using Top-k Search Results , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.