Unsupervised Evaluation of Text Co-clustering Algorithms Using Neural Word Embeddings

Text clustering, which allows to divide a dataset into groups of similar documents, plays an important role at various stages of the information retrieval process. Co-clustering is an extension of one-side clustering, and consists in simultaneously clustering the rows and columns of a data matrix. However, while co-clustering algorithms consider both dimensions of a document-term matrix, they are usually evaluated on the quality of the obtained document clusters alone. In this paper, we therefore propose an evaluation scheme that accounts for the two-dimensional nature of co-clustering algorithms, thus allowing for a more precise evaluation of their performance. Another important benefit of the proposed approach is that it does not require the use of any prior labels. This is achieved by leveraging large, public domain embedding matrices (GloVe, word2vec, FastText) to compute comparable representations of both document and term clusters. Experiments carried out on several textual datasets show that the proposed measures are both reliable and stable, and can even provide hints to improve co-clustering performance.