VRCA: A Clustering Algorithm for Massive Amount of Texts

There are lots of texts appearing in the web every day. This fact enables the amount of texts in the web to explode. Therefore, how to deal with large-scale text collection becomes more and more important. Clustering is a generally acceptable solution for text organization. Via its unsupervised characteristic, users can easily dig the useful information that they desired. However, traditional clustering algorithms can only deal with small-scale text collection. When it enlarges, they lose their performances. The main reason attributes to the high-dimensional vectors generated from texts. Therefore, to cluster texts in large amount, this paper proposes a novel clustering algorithm, where only the features that can represent cluster are preserved in cluster's vector. In this algorithm, clustering process is separated into two parts. In one part, feature's weight is fine-tuned to make cluster partition meet an optimization function. In the other part, features are reordered and only the useful features that can represent cluster are kept in cluster's vector. Experimental results demonstrate that our algorithm obtains high performance on both small-scale and large-scale text collections.

[1]  Olcay Kursun,et al.  A method for combining mutual information and canonical correlation analysis: Predictive Mutual Information and its use in feature selection , 2012, Expert Syst. Appl..

[2]  Michael W. Berry,et al.  Large-Scale Information Retrieval with Latent Semantic Indexing , 1997, Inf. Sci..

[3]  Jiawei Han,et al.  SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[5]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[7]  Andrew Trotman,et al.  Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval , 2014, SIGIR 2014.

[8]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[9]  Journal of Information Science , 1984 .

[10]  Peter Tino,et al.  IEEE Transactions on Neural Networks , 2009 .

[11]  D. Steinley Journal of Classification , 2004, Vegetatio.

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Evangelos E. Milios,et al.  Latent Dirichlet Co-Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Lei Lin,et al.  Probability-based text clustering algorithm by alternately repeating two operations , 2013, J. Inf. Sci..

[15]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[16]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[17]  Chong Wu,et al.  Weight evaluation for features via constrained data-pairscan't-linkq , 2014, Inf. Sci..

[18]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Fei Wang,et al.  Regularized clustering for documents , 2007, SIGIR.

[20]  Alan F. Murray,et al.  International Joint Conference on Neural Networks , 1993 .

[21]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[22]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[23]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[24]  M. Cugmas,et al.  On comparing partitions , 2015 .

[25]  Juan Carlos Gomez,et al.  PCA document reconstruction for email classification , 2012, Comput. Stat. Data Anal..

[26]  ScienceDirect Computational statistics & data analysis , 1983 .

[27]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[28]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[30]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[31]  M.K. Sundareshan,et al.  Comparison of self-organizing map with K-means hierarchical clustering for bioinformatics applications , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[32]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[33]  Maoguo Gong,et al.  Spectral clustering with eigenvector selection based on entropy ranking , 2010, Neurocomputing.

[34]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.