Sparse Poisson Latent Block Model for Document Clustering

Over the last decades, several studies have demonstrated the importance of co-clustering to simultaneously produce groups of objects and features. Even to obtain object clusters only, using co-clustering is often more effective than one-way clustering, especially when considering sparse high dimensional data. In this paper, we present a novel generative mixture model for co-clustering such data. This model, the Sparse Poisson Latent Block Model (SPLBM), is based on the Poisson distribution, which arises naturally for contingency tables, such as document-term matrices. The advantages of SPLBM are two-fold. First, it is a rigorous statistical model which is also very parsimonious. Second, it has been designed from the ground up to deal with data sparsity problems. As a consequence, in addition to seeking homogeneous blocks, as other available algorithms, it also filters out homogeneous but noisy ones due to the sparsity of the data. Experiments on various datasets of different size and structure show that an algorithm based on SPLBM clearly outperforms state-of-the-art algorithms. Most notably, the SPLBM-based algorithm presented here succeeds in retrieving the natural cluster structure of difficult, unbalanced datasets which other known algorithms are unable to handle effectively.

[1]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[2]  Mohamed Nadif,et al.  Co-clustering for Binary and Categorical Data with Maximum Modularity , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[4]  G. J. McLachlan,et al.  9 The classification and mixture maximum likelihood approaches to cluster analysis , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[5]  David Newman,et al.  External evaluation of topic models , 2009 .

[6]  Jing Hua,et al.  Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115) , 2009, IEEE Transactions on Visualization and Computer Graphics.

[7]  Gérard Govaert,et al.  Clustering with block mixture models , 2003, Pattern Recognit..

[8]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[10]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[12]  Maurizio Vichi,et al.  Two-mode multi-partitioning , 2008, Comput. Stat. Data Anal..

[13]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[14]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[15]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[16]  Gérard Govaert,et al.  An EM algorithm for the block mixture model , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Mohamed Nadif,et al.  Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation , 2014, Knowl. Based Syst..

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[20]  H. Bock Convexity-based clustering criteria: theory, algorithms, and applications in statistics , 2004 .

[21]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[22]  Mohamed Nadif,et al.  Co-clustering , 2013, Encyclopedia of Database Systems.

[23]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[26]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[27]  Seungjin Choi,et al.  Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds , 2010, Inf. Process. Manag..

[28]  Mohamed Nadif,et al.  Co-clustering Document-term Matrices by Direct Maximization of Graph Modularity , 2015, CIKM.

[29]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[30]  Vichi Maurizio Double k-means Clustering for Simultaneous Classification of Objects and Variables , 2001 .

[31]  Mohamed Nadif,et al.  Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information , 2011, KDIR.

[32]  Gérard Govaert,et al.  Mutual information, phi-squared and model-based co-clustering for contingency tables , 2016, Advances in Data Analysis and Classification.

[33]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.