论文信息 - Improving the presentation of search results by multipartite graph clustering of multiple reformulated queries and a novel document representation

Improving the presentation of search results by multipartite graph clustering of multiple reformulated queries and a novel document representation

The goal of clustering web search results is to reveal the semantics of the retrieved documents. The main challenge is to make clustering partition relevant to a user’s query. In this paper, we describe a method of clustering search results using a similarity measure between documents retrieved by multiple reformulated queries. The method produces clusters of documents that are most relevant to the original query and, at the same time, represent a more diverse set of semantically related queries. In order to cluster thousands of documents in real time, we designed a novel multipartite graph clustering algorithm that has low polynomial complexity and no manually adjusted hyper–parameters. The loss of semantics resulting from the stem–based document representation is a common problem in information retrieval. To address this problem, we propose an alternative novel document representation, under which words are represented by their synonymy groups.

[1] Jun Li,et al. A Model Search Engine Based on Cluster Analysis of User Search Terms , 2005 .

[2] Nizar Grira,et al. Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[3] Gordon W. Paynter,et al. Predicting Library of Congress classifications from Library of Congress subject headings , 2004, J. Assoc. Inf. Sci. Technol..

[4] José Gabriel Pereira Lopes,et al. Document clustering and cluster topic extraction in multilingual corpora , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5] Leonid Perlovsky,et al. Neural Networks and Intellect: Using Model-Based Concepts , 2000, IEEE Transactions on Neural Networks.

[6] Rui Xu,et al. Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[7] Evangelos E. Milios,et al. AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA , 2003 .

[8] Sergey Brin,et al. Dynamic Data Mining: Exploring Large Rule Spaces by Sampling. , 1999 .

[9] Nozha Boujemaa,et al. Active semi-supervised fuzzy clustering for image database categorization , 2005, MIR '05.

[10] George A. Miller,et al. WordNet: A Lexical Database for the English Language , 2002 .

[11] Helmut Berger,et al. Improving Domain Ontologies by Mining Semantics from Text , 2004, APCCM.

[12] L. I. Perlovsky,et al. Integration of language and cognition at pre-conceptual level , 2003, IEMC '03 Proceedings. Managing Technologically Driven Organizations: The Human Side of Innovation and Change (IEEE Cat. No.03CH37502).

[13] Bin He,et al. Clustering Documents in Large Text Corpora , 2003 .

[14] Diego Sona,et al. Clustering documents in a web directory , 2003, WIDM '03.

[15] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.