On Clustering Validity Measures and the Rough Set Theory

Document clustering has been investigated for use in different areas of text mining and information retrieval. A clustering depends on the chosen clustering algorithm as well as on the algorithm's parameter settings; for that reason it is necessary to find the best among several clustering techniques. However, it is very difficult to evaluate a given clustering of documents. There are external, internal and relative measures. The disadvantage of external measures is the necessity of a human reference classification to evaluate the clustering. In this paper we propose the use of rough-set-based measures for document clustering evaluation, basing our calculations solely on the clustering that has to be evaluated. Thus, two advantages of rough set theory are used: it does not need any preliminary or additional information about data, and it is a tool for use in computer applications in circumstances which are characterized by vagueness and uncertainty (this is the case of document clustering). We illustrate the use of the novel measures.

[1]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[2]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[3]  Benno Stein,et al.  Automatic Document Categorization: Interpreting the Perfomance of Clustering Algorithms , 2003, KI.

[4]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[5]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[6]  Salvatore Greco,et al.  Fuzzy Similarity Relation as a Basis for Rough Approximations , 1998, Rough Sets and Current Trends in Computing.

[7]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[8]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[9]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[10]  H. Bunke,et al.  A Comparison of Two Novel Algorithms for Clustering Web Documents , 2003 .

[11]  Viggo Kann,et al.  Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .

[12]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[14]  D. Vanderpooten Similarity Relation as a Basis for Rough Approximations , 1995 .

[15]  James C. Bezdek,et al.  A geometric approach to cluster validity for normal mixtures , 1997, Soft Comput..

[16]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[17]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[18]  Ying Zhao,et al.  Effective document clustering for large heterogeneous law firm collections , 2005, International Conference on Artificial Intelligence and Law.

[19]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[20]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.