Textual data summarization using the Self-Organized Co-Clustering model

Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model’s inference as well as a model selection criterion to choose the number of coclusters. Both simulated and real data sets illustrate the eciency of this model by its ability to easily identify relevant co-clusters.

[1]  Gérard Govaert,et al.  Clustering with block mixture models , 2003, Pattern Recognit..

[2]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[3]  Mohamed Nadif,et al.  Diagonal latent block model for binary data , 2016, Statistics and Computing.

[4]  傅慧雯 《Harry Potter and the Philosopher’s Stone》之中譯與譯評 , 2008 .

[5]  Xiaolin Li,et al.  GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model , 2018, EMNLP.

[6]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  George Drosatos,et al.  A probabilistic semantic analysis of eHealth scientific literature , 2020, Journal of telemedicine and telecare.

[9]  Dorota Garczarczyk,et al.  Keeping track of motion events in translation. A case of Spanish translation of J.K. Rowling’s Harry Potter and the Chamber of Secrets , 2012 .

[10]  Vincent Brault Estimation et sélection de modèle pour le modèle des blocs latents , 2014 .

[11]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[12]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[13]  Mohamed Nadif,et al.  Co-clustering , 2013, Encyclopedia of Database Systems.

[14]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[17]  Nicoletta Del Buono,et al.  Non-negative Matrix Tri-Factorization for co-clustering: An analysis of the block matrix , 2015, Inf. Sci..

[18]  Feiping Nie,et al.  Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[19]  Mika Mäntylä,et al.  Measuring LDA topic stability from clusters of replicated runs , 2018, ESEM.

[20]  M. Cugmas,et al.  On comparing partitions , 2015 .

[21]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  G. Govaert,et al.  Latent Block Model for Contingency Table , 2010 .

[24]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[25]  Mohamed Nadif,et al.  Word Co-Occurrence Regularized Non-Negative Matrix Tri-Factorization for Text Data Co-Clustering , 2018, AAAI.

[26]  Julien Jacques,et al.  Model-based co-clustering for mixed type data , 2020, Comput. Stat. Data Anal..

[27]  Mohamed Nadif,et al.  Hard and fuzzy diagonal co-clustering for document-term partitioning , 2016, Neurocomputing.

[28]  Mohamed Nadif,et al.  Sparse Poisson Latent Block Model for Document Clustering , 2017, IEEE Transactions on Knowledge and Data Engineering.

[29]  Sadaaki Miyamoto,et al.  Spherical k-Means++ Clustering , 2015, MDAI.

[30]  Tanasanee Phienthrakul,et al.  Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , 2019, ACL.

[31]  Jing Hua,et al.  Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115) , 2009, IEEE Transactions on Visualization and Computer Graphics.

[32]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..

[33]  Ievgen Redko,et al.  Co-clustering through Optimal Transport , 2017, ICML.

[34]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.