Clustering Documents with Large Overlap of Terms into Different Clusters based on Similarity Rough Set Model

Similarity rough set model for document clustering (SRSM) uses a generalized rough set model based on similarity relation and term co-occurrence to group documents in the collection into clusters. The model is extended from tolerance rough set model (TRSM) (Ho and Funakoshi, 1997). The SRSM methods have been evaluated and the results showed that it perform better than TRSM. However, in document collections where there are words overlapped in different document classes, the effect of SRSM is rather small. In this paper we propose a method to improve the performance of SRSM method in such document collections.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[3]  Nguyen Chi Thanh,et al.  A Similarity Rough Set Model for Document Representation and Document Clustering , 2011, J. Adv. Comput. Intell. Intell. Informatics.

[4]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[5]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[6]  Hassan Abolhassani,et al.  Harmony K-means algorithm for document clustering , 2009, Data Mining and Knowledge Discovery.

[7]  Qingcai Chen,et al.  A Tolerance Rough Set Based Semantic Clustering Method for Web Search Results , 2009 .

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  R. Luce Semiorders and a Theory of Utility Discrimination , 1956 .

[10]  Tsau Young Lin,et al.  A Review of Rough Set Models , 1997 .

[11]  Tu Bao Ho,et al.  Information Retrieval Using Rough Sets , 1998 .

[12]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[13]  Tu Bao Ho,et al.  Nonhierarchical document clustering based on a tolerance rough set model , 2002, Int. J. Intell. Syst..

[14]  Alexis Tsoukiàs,et al.  Incomplete Information Tables and Rough Classification , 2001, Comput. Intell..