A vector reconstruction based clustering algorithm particularly for large-scale text collection

Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections.

[1]  Juan Carlos Gomez,et al.  PCA document reconstruction for email classification , 2012, Comput. Stat. Data Anal..

[2]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[3]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[4]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Hujun Yin,et al.  Adaptive topological tree structure for document organisation and visualisation , 2004, Neural Networks.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Dirk Cattrysse,et al.  Pairwise-adaptive dissimilarity measure for document clustering , 2010, Inf. Sci..

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Evangelos E. Milios,et al.  Latent Dirichlet Co-Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[11]  Hyunsoo Kim,et al.  Multiclass classifiers based on dimension reduction with generalized LDA , 2007, Pattern Recognit..

[12]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Olcay Kursun,et al.  A method for combining mutual information and canonical correlation analysis: Predictive Mutual Information and its use in feature selection , 2012, Expert Syst. Appl..

[15]  Teuvo Kohonen,et al.  Self-organizing neural projections , 2006, Neural Networks.

[16]  Jianzhong Wang,et al.  Maximum weight and minimum redundancy: A novel framework for feature subset selection , 2013, Pattern Recognit..

[17]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[18]  Claudio Carpineto,et al.  Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[19]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[20]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[21]  Hujun Yin,et al.  Connection between Self-Organizing Maps and Metric Multidimensional Scaling , 2007, 2007 International Joint Conference on Neural Networks.

[22]  Maurizio Marchese,et al.  Text Clustering with Seeds Affinity Propagation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[23]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[24]  Andrei Popescu-Belis,et al.  Computing text semantic relatedness using the contents and links of a hypertext encyclopedia , 2013, Artif. Intell..

[25]  Dit-Yan Yeung,et al.  Robust locally linear embedding , 2006, Pattern Recognit..

[26]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[27]  Shengrui Wang,et al.  Automated feature weighting in naive bayes for high-dimensional data classification , 2012, CIKM.

[28]  Changshui Zhang,et al.  Exploring the structure of supervised data by Discriminant Isometric Mapping , 2005, Pattern Recognit..

[29]  Daniel Pullwitt Integrating contextual information to enhance SOM-based text document clustering , 2002, Neural Networks.

[30]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[31]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[32]  Guihai Chen,et al.  ADSS: An approach to determining semantic similarity , 2006, Adv. Eng. Softw..

[33]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[34]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Wang Li,et al.  Fuzzy C Mean Algorithm Based on Feature Weights , 2006 .

[36]  Chengjie Sun,et al.  A Novel Self-Adaptive Clustering Algorithm for Dynamic Data , 2012, ICONIP.

[37]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[38]  Chieh-Yuan Tsai,et al.  Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[39]  Lei Lin,et al.  Probability-based text clustering algorithm by alternately repeating two operations , 2013, J. Inf. Sci..

[40]  Yadong Wang,et al.  Improving fuzzy c-means clustering based on feature-weight learning , 2004, Pattern Recognit. Lett..

[41]  Maoguo Gong,et al.  Spectral clustering with eigenvector selection based on entropy ranking , 2010, Neurocomputing.

[42]  Elias Oliveira,et al.  An incremental neural network with a reduced architecture , 2012, Neural Networks.

[43]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[44]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[45]  Jiawei Han,et al.  SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[46]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[47]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[48]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..