Improving the presentation of search results by multipartite graph clustering of multiple reformulated queries and a novel document representation

The goal of clustering web search results is to reveal the semantics of the retrieved documents. The main challenge is to make clustering partition relevant to a user’s query. In this paper, we describe a method of clustering search results using a similarity measure between documents retrieved by multiple reformulated queries. The method produces clusters of documents that are most relevant to the original query and, at the same time, represent a more diverse set of semantically related queries. In order to cluster thousands of documents in real time, we designed a novel multipartite graph clustering algorithm that has low polynomial complexity and no manually adjusted hyper–parameters. The loss of semantics resulting from the stem–based document representation is a common problem in information retrieval. To address this problem, we propose an alternative novel document representation, under which words are represented by their synonymy groups.

[1]  Jun Li,et al.  A Model Search Engine Based on Cluster Analysis of User Search Terms , 2005 .

[2]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[3]  Gordon W. Paynter,et al.  Predicting Library of Congress classifications from Library of Congress subject headings , 2004, J. Assoc. Inf. Sci. Technol..

[4]  José Gabriel Pereira Lopes,et al.  Document clustering and cluster topic extraction in multilingual corpora , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  Leonid Perlovsky,et al.  Neural Networks and Intellect: Using Model-Based Concepts , 2000, IEEE Transactions on Neural Networks.

[6]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[7]  Evangelos E. Milios,et al.  AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA , 2003 .

[8]  Sergey Brin,et al.  Dynamic Data Mining: Exploring Large Rule Spaces by Sampling. , 1999 .

[9]  Nozha Boujemaa,et al.  Active semi-supervised fuzzy clustering for image database categorization , 2005, MIR '05.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[11]  Helmut Berger,et al.  Improving Domain Ontologies by Mining Semantics from Text , 2004, APCCM.

[12]  L. I. Perlovsky,et al.  Integration of language and cognition at pre-conceptual level , 2003, IEMC '03 Proceedings. Managing Technologically Driven Organizations: The Human Side of Innovation and Change (IEEE Cat. No.03CH37502).

[13]  Bin He,et al.  Clustering Documents in Large Text Corpora , 2003 .

[14]  Diego Sona,et al.  Clustering documents in a web directory , 2003, WIDM '03.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.