Regularized Dual-PPMI Co-clustering for Text Data

Co-clustering of document-term matrices has proved to be more effective than one-sided clustering. By their nature, text data are also generally unbalanced and directional. Recently, the von Mises-Fisher (vMF) mixture model was proposed to handle unbalanced data while harnessing the directional nature of text. In this paper we propose a novel co-clustering approach based on a matrix formulation of vMF model-based co-clustering. This formulation leads to a flexible method for text co-clustering that can easily incorporate both word-word semantic relationships and document-document similarities. By contrast with existing methods, which generally use an additive incorporation of similarities, we propose a dual multiplicative regularization that better encapsulates the underlying text data structure. Extensive evaluations on various real-world text datasets demonstrate the superior performance of our proposed approach over baseline and competitive methods, both in terms of clustering results and co-cluster topic coherence.

[1]  Stanley C. Ahalt,et al.  Competitive learning algorithms for vector quantization , 1990, Neural Networks.

[2]  Mohamed Nadif,et al.  Directional co-clustering , 2019, Adv. Data Anal. Classif..

[3]  P. Deb Finite Mixture Models , 2008 .

[4]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[7]  Mohamed Nadif,et al.  Non-negative Matrix Factorization Meets Word Embedding , 2017, SIGIR.

[8]  M. Cugmas,et al.  On comparing partitions , 2015 .

[9]  Mohamed Nadif,et al.  Social regularized von Mises–Fisher mixture model for item recommendation , 2017, Data Mining and Knowledge Discovery.

[10]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[11]  Mohamed Nadif,et al.  Model-based von Mises-Fisher Co-clustering with a Conscience , 2017, SDM.

[12]  Mohamed Nadif,et al.  Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information , 2011, KDIR.

[13]  Mohamed Nadif,et al.  Word Co-Occurrence Regularized Non-Negative Matrix Tri-Factorization for Text Data Co-Clustering , 2018, AAAI.

[14]  David Newman,et al.  External evaluation of topic models , 2009 .

[15]  Duane DeSieno,et al.  Adding a conscience to competitive learning , 1988, IEEE 1988 International Conference on Neural Networks.

[16]  Lazhar Labiod,et al.  Ensemble Block Co-clustering: A Unified Framework for Text Data , 2020, CIKM.

[17]  Mohamed Nadif,et al.  Graph modularity maximization as an effective method for co-clustering text data , 2016, Knowl. Based Syst..

[18]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[19]  Yiming Yang,et al.  Von Mises-Fisher Clustering Models , 2014, ICML.

[20]  Mohamed Nadif,et al.  CoClust: A Python Package for Co-Clustering , 2019, Journal of Statistical Software.

[21]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[22]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.