A Novel Method for Clustering Web Search Results with Wikipedia Disambiguation Pages

Organizing search results of an ambiguous query into topics can facilitate information search on the Web. In this paper, we propose a novel method to cluster search results of ambiguous query into topics about the query constructed from Wikipedia disambiguation pages (WDP). To improve the clustering result, we propose a concept filtering method to filter semantically unrelated concepts in each topic. Also, we propose the top K full relations (TKFR) algorithm to assign search results to relevant topics based on the similarities between concepts in the results and topics. Comparing with the clustering methods whose topic labels are extracted from search results, the topics of WDP which are edited by human are much more helpful for navigation. The experiment results show that our method can work for ambiguous queries with different query lengths and highly improves the clustering result of method using WDP.

[1]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[2]  Haoran Xie,et al.  Mining Latent User Community for Tag-Based and Content-Based Search in Social Media , 2014, Comput. J..

[3]  Claudio Carpineto,et al.  Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[4]  Haoran Xie,et al.  Community-Aware Resource Profiling for Personalized Search in Folksonomy , 2012, Journal of Computer Science and Technology.

[5]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[8]  Lakhmi C. Jain,et al.  Innovations in Machine Learning , 2006 .

[9]  Roberto Navigli,et al.  Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction , 2013, CL.

[10]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[11]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[12]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[13]  Haoran Xie,et al.  Community-aware user profile enrichment in folksonomy , 2014, Neural Networks.

[14]  M. K. Luhandjula Studies in Fuzziness and Soft Computing , 2013 .

[15]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[16]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[17]  Raghu Krishnapuram,et al.  A clustering algorithm for asymmetrically related data with applications to text mining , 2001, CIKM '01.

[18]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[19]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[20]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[21]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.