Informative Polythetic Hierarchical Ephemeral Clustering

Ephemeral clustering has been studied for more than a decade, although with low user acceptance. According to us, this situation is mainly due to (1) an excessive number of generated clusters, which makes browsing difficult and (2) low quality labeling, which introduces imprecision within the search process. In this paper, our motivation is twofold. First, we propose to reduce the number of clusters of Web page results, but keeping all different query meanings. For that purpose, we propose a new polythetic methodology based on an informative similarity measure, the InfoSimba, and a new hierarchical clustering algorithm, the HISGK-means. Second, a theoretical background is proposed to define meaningful cluster labels embedded in the definition of the HISGK-means algorithm, which may elect as best label, words outside the given cluster. To confirm our intuitions, we propose a new evaluation framework, which shows that we are able to extract most of the important query meanings but generating much less clusters than state-of-the-art systems.

[1]  Bruno Martins,et al.  Hierarchical Soft Clustering and Automatic Text Summarization for Accessing the Web on Mobile Devices for Visually Impaired People , 2009, FLAIRS Conference.

[2]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[3]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[4]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[5]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[6]  Claudio Carpineto,et al.  Exploiting the Potential of Concept Lattices for Information Retrieval with CREDO , 2004, J. Univers. Comput. Sci..

[7]  Raghu Krishnapuram,et al.  A clustering algorithm for asymmetrically related data with applications to text mining , 2001, CIKM '01.

[8]  Gaël Dias Information Digestion , 2010 .

[9]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2008, Softw. Pract. Exp..

[10]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[11]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[12]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[13]  Ricardo Campos,et al.  Automatic Hierarchical Clustering of Web Pages , 2005 .

[14]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[15]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Anupam Joshi,et al.  Retriever: Improving Web Search Engine Results Using Clustering , 2000 .

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Israel Ben-Shaul,et al.  Ephemeral Document Clustering for Web Applications , 2001 .

[20]  Bruno Martins,et al.  Universal Mobile Information Retrieval , 2009, HCI.

[21]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.