Semi-supervised Clustering for Short Text via Deep Representation Learning

In this work, we propose a semi-supervised method for short text clustering, where we represent texts as distributed vectors with neural networks, and use a small amount of labeled data to specify our intention for clustering. We design a novel objective to combine the representation learning process and the k-means clustering process together, and optimize the objective with both labeled data and unlabeled data iteratively until convergence through three steps: (1) assign each short text to its nearest centroid based on its representation from the current neural networks; (2) re-estimate the cluster centroids based on cluster assignments from step (1); (3) update neural networks according to the objective by keeping centroids and cluster assignments fixed. Experimental results on four datasets show that our method works significantly better than several other text clustering methods.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[3]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[4]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[5]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[9]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[10]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[14]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Peng Wang,et al.  Short Text Clustering via Convolutional Neural Networks , 2015, VS@HLT-NAACL.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Ian Davidson,et al.  Reveling in Constraints , 2009, ACM Queue.

[19]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[20]  Eric Bair,et al.  Semi‐supervised clustering methods , 2013, Wiley interdisciplinary reviews. Computational statistics.

[21]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[22]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.