论文信息 - On Clustering Validity Measures and the Rough Set Theory

On Clustering Validity Measures and the Rough Set Theory

Document clustering has been investigated for use in different areas of text mining and information retrieval. A clustering depends on the chosen clustering algorithm as well as on the algorithm's parameter settings; for that reason it is necessary to find the best among several clustering techniques. However, it is very difficult to evaluate a given clustering of documents. There are external, internal and relative measures. The disadvantage of external measures is the necessity of a human reference classification to evaluate the clustering. In this paper we propose the use of rough-set-based measures for document clustering evaluation, basing our calculations solely on the clustering that has to be evaluated. Thus, two advantages of rough set theory are used: it does not need any preliminary or additional information about data, and it is a tool for use in computer applications in circumstances which are characterized by vagueness and uncertainty (this is the case of document clustering). We illustrate the use of the novel measures.

[1] Barbara Di Eugenio,et al. Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[2] Michalis Vazirgiannis,et al. On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[3] Benno Stein,et al. Automatic Document Categorization: Interpreting the Perfomance of Clustering Algorithms , 2003, KI.

[4] Michael W. Berry,et al. Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[5] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[6] Salvatore Greco,et al. Fuzzy Similarity Relation as a Basis for Rough Approximations , 1998, Rough Sets and Current Trends in Computing.

[7] Venkata Subramaniam,et al. Information Retrieval: Data Structures & Algorithms , 1992 .

[8] Gerard Salton,et al. The SMART Retrieval System , 1971 .

[9] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[10] H. Bunke,et al. A Comparison of Two Novel Algorithms for Clustering Web Documents , 2003 .

[11] Viggo Kann,et al. Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .

[12] Michael W. Berry,et al. Survey of Text Mining , 2003, Springer New York.

[13] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[14] D. Vanderpooten. Similarity Relation as a Basis for Rough Approximations , 1995 .

[15] James C. Bezdek,et al. A geometric approach to cluster validity for normal mixtures , 1997, Soft Comput..

[16] Benno Stein,et al. On Cluster Validity and the Information Need of Users , 2003 .

[17] Samuel Kaski,et al. Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[18] Ying Zhao,et al. Effective document clustering for large heterogeneous law firm collections , 2005, International Conference on Artificial Intelligence and Law.

[19] V. J. Rayward-Smith,et al. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[20] Donald W. Bouldin,et al. A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.