A new descriptive clustering algorithm based on Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) provides a way for finding a part-based representation of nonnegative data. An important property of NMF is that it can produce a sparse representation of the data; however, in some applications, especially in text clustering, the sparse representation always consists of separated words, which cannot explicitly express the meaning of the basis vector. This paper presents a new descriptive clustering algorithm based on NMF, called DC-NMF that can avoid this separated word problem. In our proposed method, we embrace the phrase-by-document matrix in addition to the commonly used term-by-document matrix. Then, we use conjunct gradient descent to minimize the mean squared error objective function. Finally, we describe each cluster with the highest weighted element corresponding to one particular phrase. Our experimental results indicate that our method can obtain more ldquopurerdquo clusters than other methods.