Succinct and Informative Cluster Descriptions for Document Repositories

Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels. This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.

[1]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[2]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[3]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[4]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[5]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[6]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[7]  Gautam Biswas,et al.  ITERATE: a conceptual clustering algorithm for data mining , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[8]  Eduard H. Hovy,et al.  Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[11]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[12]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  Andreas Hotho,et al.  Conceptual Clustering of Text Clusters , 2003 .

[14]  David R. Karger,et al.  Scatter/Gather as a Tool for the Navigation of Retrieval Results , 1995 .

[15]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[16]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[17]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[18]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  Enric Plaza,et al.  Machine Learning: ECML 2000 , 2003, Lecture Notes in Computer Science.

[21]  Jiawei Han,et al.  Exploration of the power of attribute-oriented induction in data mining , 1995, KDD 1995.

[22]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[23]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[24]  Padraig Cunningham,et al.  Diversity versus Quality in Classification Ensembles Based on Feature Selection , 2000, ECML.

[25]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.