Word2Cluster: A New Multi-Label Text Clustering Algorithm with an Adaptive Clusters Number

Text clustering has been widely used in many Natural Language Processing (NLP) applications such as text summarization and news recommendation. However, most of the current algorithms need to predefine a clustering number, which is difficult to obtain. Moreover, the mutli-label clustering is useful in multiple clustering tasks in many applications, but related works are rarely available. Although several studies have attempted to solve above two problems, there is a need for methods that can solve the two issues simultaneously. Therefore, we propose a new text clustering algorithm called Word2Cluster. Word2Cluster can automatically generate an adaptive number of clusters and support multi-label clustering. To test the performance of Wrod2Cluster, we build a Chinese text dataset, Hotline, according to real world applications. To evaluate the clustering results better, we propose an improved evaluation method based on basic accuracy, precision and recall for multi-label text clustering. Experimental results on a Chinese text dataset (Hotline) and a public English text dataset (Reuters) demonstrate that our algorithm can achieve better F1-measure and runs faster than the state-of- the-art baselines.

[1]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[2]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[3]  Nicandro Cruz-Ramírez,et al.  Improved multi-objective clustering with automatic determination of the number of clusters , 2016, Neural Computing and Applications.

[4]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Ben J Hicks,et al.  Improving engineering information retrieval by combining tf-idf and product structure classification , 2017 .

[13]  Kathleen McKeown,et al.  Cluster-based Web Summarization , 2013, IJCNLP.

[14]  Peng Li,et al.  Using Clustering to Improve Retrieval Evaluation without Relevance Judgments , 2010, COLING.

[15]  Zenglin Xu,et al.  Adaptive local structure learning for document co-clustering , 2018, Knowl. Based Syst..

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[21]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.