Sparse Tensor Co-clustering as a Tool for Document Categorization

To deal with document clustering, we usually rely on document-term matrices. However, from additional available information like keywords, co-authors, citations we might rather exploit a reorganization of the data in the form of a tensor. In this paper, we extend the use of the Sparse Poisson Latent Block Model to deal with sparse tensor data using jointly all information arising from documents. The proposed model is parsimonious and tailored for this kind of data. To estimate the parameters, we derive a suitable tensor co-clustering algorithm. Empirical results on several real-world text datasets highlight the advantages of our proposal which improves the clustering results of documents.

[1]  Maja Pantic,et al.  TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..

[2]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[3]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[4]  David Tse,et al.  Tensor Biclustering , 2017, NIPS.

[5]  Mohamed Nadif,et al.  Directional co-clustering , 2019, Adv. Data Anal. Classif..

[6]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[7]  Mohamed Nadif,et al.  Sparse Poisson Latent Block Model for Document Clustering , 2017, IEEE Transactions on Knowledge and Data Engineering.

[8]  Gérard Govaert,et al.  An EM algorithm for the block mixture model , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tao Wu,et al.  General Tensor Spectral Co-clustering for Higher-Order Data , 2016, NIPS.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Mohamed Nadif,et al.  CoClust: A Python Package for Co-Clustering , 2019, Journal of Statistical Software.

[12]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[13]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[14]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[15]  J. Pagès Multiple Factor Analysis by Example Using R , 2014 .