A Text Clustering System based on k-means Type Subspace Clustering and Ontology

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g., Bisecting-KMeans. Furthermore, the word extraction method is effective in selection of the words to represent the topics of the clusters. Keywords—Subspace Clustering, Text Mining, Feature Weighting, Cluster Interpretation, Ontology

[1]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[4]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[7]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[8]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[9]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[10]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  George Karypis,et al.  Comparison of Agglomerative and Partitional Document Clustering Algorithms , 2002 .

[12]  Ivan Herman,et al.  Graph Visualization and Navigation in Information Visualization: A Survey , 2000, IEEE Trans. Vis. Comput. Graph..

[13]  D. Swanson Medical literature as a potential source of new knowledge. , 1990, Bulletin of the Medical Library Association.

[14]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[15]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[17]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[18]  Paul Harris,et al.  Sampling and Statistics , 1993 .

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Weiguo Fan,et al.  Tapping the power of text mining , 2006, CACM.

[21]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[22]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[24]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[25]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[26]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[27]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[28]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.