A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.

[1]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[2]  Zahid Halim,et al.  Multi-view document clustering via ensemble method , 2014, Journal of Intelligent Information Systems.

[3]  Jian-Ping Mei,et al.  Proximity-based k-partitions clustering with ranking for document categorization and analysis , 2014, Expert Syst. Appl..

[4]  Peng Wang,et al.  Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification , 2016, Neurocomputing.

[5]  Asim Karim,et al.  Clustering and Understanding Documents via Discrimination Information Maximization , 2012, PAKDD.

[6]  Wanli Zuo,et al.  A fuzzy document clustering approach based on domain-specified ontology , 2015, Data Knowl. Eng..

[7]  Mehrnoush Shamsfard,et al.  An improved bee colony optimization algorithm with an application to document clustering , 2015, Neurocomputing.

[8]  Xiang Li,et al.  Adaptive subspace learning: an iterative approach for document clustering , 2013, Neural Computing and Applications.

[9]  Alan L. Porter,et al.  Clustering scientific documents with topic modeling , 2014, Scientometrics.

[10]  Osmar R. Zaïane,et al.  Extraction and clustering of arguing expressions in contentious text , 2015, Data Knowl. Eng..

[11]  Yongliang Wang,et al.  Text clustering using VSM with feature clusters , 2014, Neural Computing and Applications.

[12]  Jorge Martinez-Gil An overview of textual semantic similarity measures based on web intelligence , 2014 .

[13]  Tao Wu,et al.  Automated Graph Regularized Projective Nonnegative Matrix Factorization for Document Clustering , 2014, IEEE Transactions on Cybernetics.

[14]  Pramod Kumar Singh,et al.  Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering , 2015, Expert Syst. Appl..

[15]  Chen Qi,et al.  A text mining model based on improved density clustering algorithm , 2013, 2013 IEEE 4th International Conference on Electronics Information and Emergency Communication.

[16]  Peng Zhang,et al.  Mining streams of short text for analysis of world-wide event evolutions , 2014, World Wide Web.

[17]  Xijin Tang,et al.  TESC: An approach to TExt classification using Semi-supervised Clustering , 2015, Knowl. Based Syst..

[18]  Lu Liu,et al.  A novel incremental conceptual hierarchical text clustering method using CFu-tree , 2015, Appl. Soft Comput..

[19]  Agma J. M. Traina,et al.  Open issues for partitioning clustering methods: an overview , 2014, WIREs Data Mining Knowl. Discov..

[20]  Albert Y. Zomaya,et al.  A survey on text mining in social networks , 2015, The Knowledge Engineering Review.

[21]  Tsau Young Lin,et al.  Clustering High Dimensional Data Using SVM , 2009, RSFDGrC.

[22]  Wael Khreich,et al.  A Survey of Techniques for Event Detection in Twitter , 2015, Comput. Intell..

[23]  C. J. van Rijsbergen,et al.  Learning semantic relatedness from term discrimination information , 2009, Expert Syst. Appl..

[24]  Xiaoying Gao,et al.  Multi-view clustering of web documents using multi-objective genetic algorithm , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[25]  Nitin Indurkhya,et al.  Emerging directions in predictive text mining , 2015, WIREs Data Mining Knowl. Discov..

[26]  Nikolaos G. Bourbakis,et al.  Graph-Based Methods for Natural Language Processing and Understanding—A Survey and Analysis , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[27]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[28]  Fei Wang,et al.  Survey on distance metric learning and dimensionality reduction in data mining , 2014, Data Mining and Knowledge Discovery.

[29]  Volkan Tunali,et al.  An ımproved clustering algorithm for text mining: multi-cluster spherical k-means , 2016, Int. Arab J. Inf. Technol..

[30]  Soon Myoung Chung,et al.  A parallel text document clustering algorithm based on neighbors , 2015, Cluster Computing.

[31]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[32]  G. Grahne,et al.  High Performance Mining of Maximal Frequent Itemsets Gösta , 2003 .

[33]  Sunghae Jun,et al.  Document clustering method using dimension reduction and support vector clustering to overcome sparseness , 2014, Expert Syst. Appl..

[34]  Marcos Aurélio Domingues,et al.  Privileged Information for Hierarchical Document Clustering: A Metric Learning Approach , 2014, 2014 22nd International Conference on Pattern Recognition.

[35]  Ruimin Shen,et al.  Sparse Group Restricted Boltzmann Machines , 2010, AAAI.

[36]  Pramod Kumar Singh,et al.  Chaotic gradient artificial bee colony for text clustering , 2016, Soft Comput..

[37]  Themis Palpanas,et al.  Survey on mining subjective data on the web , 2011, Data Mining and Knowledge Discovery.

[38]  Ugo Erra,et al.  Approximate TF-IDF based on topic extraction from massive message stream using the GPU , 2015, Inf. Sci..

[39]  Fanzhang Li,et al.  Semi-supervised concept factorization for document clustering , 2016, Inf. Sci..

[40]  W. Z. Zhu,et al.  Document clustering using the LSI subspace signature model , 2013, J. Assoc. Inf. Sci. Technol..

[41]  Philip S. Yu,et al.  On the Use of Side Information for Mining Text Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[42]  Yan Zhang,et al.  Adaptive Concept Resolution for document representation and its applications in text mining , 2015, Knowl. Based Syst..

[43]  Paolo Rosso,et al.  An efficient Particle Swarm Optimization approach to cluster short texts , 2014, Inf. Sci..

[44]  Pramod Kumar Singh,et al.  Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering , 2016, Appl. Soft Comput..

[45]  王尧,et al.  A three-phase approach to document clustering based on topic significance degree , 2014 .

[46]  Kim Schouten,et al.  Survey on Aspect-Level Sentiment Analysis , 2016, IEEE Transactions on Knowledge and Data Engineering.

[47]  Paola Velardi,et al.  Efficient temporal mining of micro-blog texts and its application to event discovery , 2015, Data Mining and Knowledge Discovery.

[48]  Mohamed El Bachir Menai,et al.  Automatic Arabic text summarization: a survey , 2015, Artificial Intelligence Review.

[49]  Wei Song,et al.  A hybrid evolutionary computation approach with its application for optimizing text document clustering , 2015, Expert Syst. Appl..

[50]  Moongu Jeon,et al.  CDIM: Document Clustering by Discrimination Information Maximization , 2015, Inf. Sci..

[51]  Carlo Zaniolo,et al.  Mining Semantic Structures from Syntactic Structures in Free Text Documents , 2014, 2014 IEEE International Conference on Semantic Computing.

[52]  Shie-Jue Lee,et al.  Multilabel Text Categorization Based on Fuzzy Relevance Clustering , 2014, IEEE Transactions on Fuzzy Systems.