A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool.

[1]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[2]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[3]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[4]  R. Sokal,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification. , 1975 .

[5]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  Marti A. Hearst Automated Discovery of WordNet Relations , 2004 .

[8]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[9]  Raghu Krishnapuram,et al.  Automatic Taxonomy Generation: Issues and Possibilities , 2003, IFSA.

[10]  P. Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997, J. Am. Soc. Inf. Sci..

[11]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[12]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[15]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[16]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[17]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[20]  Shivakumar Vaithyanathan,et al.  Model-Based Hierarchical Clustering , 2000, UAI.

[21]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[22]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[23]  Jussi Karlgren,et al.  Verbosity and Interface Design , 2000 .

[24]  Raghu Krishnapuram,et al.  A clustering algorithm for asymmetrically related data with applications to text mining , 2001, CIKM '01.

[25]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[26]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[27]  Hichem Frigui,et al.  Simultaneous categorization of text documents and identification of cluster-dependent keywords , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[28]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[29]  Shivakumar Vaithyanathan,et al.  Model Selection in Unsupervised Learning with Applications To Document Clustering , 1999, International Conference on Machine Learning.