TCUAP: A Novel Approach of Text Clustering Using Asymmetric Proximity

Text documents have sparse data spaces and current existing methods of text clustering use symmetry proximity to measure the correlation of documents. In this paper, we propose a novel approach to strengthen the discriminative feature of document objects, which uses asymmetric proximity for text clustering. We present a measure of asymmetric proximity between documents and between clusters. TCUAP is an agglomerative hierarchical clustering algorithm and carries on the clustering analysis by strong components of sparse matrix. The experimental evaluation on textual data sets demonstrates the validity and efficiency of our approach. The result shows that the measure of asymmetric proximity possesses higher accuracy than that of symmetry proximity.

[1]  John C. Gower,et al.  Measures of Similarity, Dissimilarity and Distance , 1985 .

[2]  Sergio Pissanetzky,et al.  Sparse Matrix Technology , 1984 .

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[5]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[6]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[7]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  P. Haggett,et al.  The application of multidimensional scaling methods to epidemiological data , 1995, Statistical methods in medical research.

[10]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[11]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[13]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[14]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[15]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[16]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[17]  Stefan Wermter,et al.  Selforganizing Classification on the Reuters News Corpus , 2002, COLING.