论文信息 - Unsupervised Evaluation of Text Co-clustering Algorithms Using Neural Word Embeddings

Unsupervised Evaluation of Text Co-clustering Algorithms Using Neural Word Embeddings

Text clustering, which allows to divide a dataset into groups of similar documents, plays an important role at various stages of the information retrieval process. Co-clustering is an extension of one-side clustering, and consists in simultaneously clustering the rows and columns of a data matrix. However, while co-clustering algorithms consider both dimensions of a document-term matrix, they are usually evaluated on the quality of the obtained document clusters alone. In this paper, we therefore propose an evaluation scheme that accounts for the two-dimensional nature of co-clustering algorithms, thus allowing for a more precise evaluation of their performance. Another important benefit of the proposed approach is that it does not require the use of any prior labels. This is achieved by leveraging large, public domain embedding matrices (GloVe, word2vec, FastText) to compute comparable representations of both document and term clusters. Experiments carried out on several textual datasets show that the proposed measures are both reliable and stable, and can even provide hints to improve co-clustering performance.

[1] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[2] Chong Wang,et al. Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[3] David Newman,et al. External evaluation of topic models , 2009 .

[4] L. Hubert,et al. Comparing partitions , 1985 .

[5] Tom M. Mitchell,et al. Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding , 2012, COLING.

[6] Chris H. Q. Ding,et al. On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[7] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[10] Prakhar Gupta,et al. Learning Word Vectors for 157 Languages , 2018, LREC.

[11] Inderjit S. Dhillon,et al. Information-theoretic co-clustering , 2003, KDD '03.

[12] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[13] Mohamed Nadif,et al. Graph modularity maximization as an effective method for co-clustering text data , 2016, Knowl. Based Syst..