Co-clustering for Dual Topic Models

Biclustering is a data mining method that allows simultaneous clustering of two variables row and columns of a matrix. A bicluster typically corresponds to a sub-matrix that presents some coherent tendency. A traditional biclustering task for categorical variables is to determine heavy sub-graphs correspond to significant biclusters, i.e., biclusters with high co-occurrence values. Though algorithms have been proposed to extract sub-graphs biclusters, they present limited knowledge about the relevant importance of individual bicluster, as well as an importance of the variables for each bicluster. To address above problems, there have been several attempts to employ Bayesian method or mixture models using information theory. Although they can rank the biclusters and the variables for specific bicluster; they do not aim at extracting heavy sub-graphs biclusters. Moreover, these models force the search for biclusters in such a way that each cell in the matrix must engage in some bicluster. We attempt to mitigate these constraints employing dual topic models. In particular first, we propose a generalised Latent Dirichlet Allocation (LDA) topic model that obtains dual topics, i.e., topics in opposite directions: row and column topics. To achieve better topics, it applies joint reinforcement, i.e., considering column-topics while creating row-topics, and vice versa. Heavy sub-graphs biclusters, the high co-occurred association, are extracted using thresholds. We demonstrate that our proposed model Co-clustering for Dual Topic is useful for obtaining heavy sub-graphs biclusters by testing over a simulated data, a text corpus and a microarray gene expression data. The experimental results show that biclusters extracted by Co-clustering for Dual Topic model are better than traditional biclustering models.

[1]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[2]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[3]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[4]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[5]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[8]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[9]  Y. Benjamini,et al.  More powerful procedures for multiple significance testing. , 1990, Statistics in medicine.

[10]  Mohamed F. Mokbel,et al.  Location-based and preference-aware recommendation using sparse geo-social networking data , 2012, SIGSPATIAL/GIS.

[11]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[15]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[16]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[17]  Kathryn B. Laskey,et al.  Nonparametric Bayesian Co-clustering Ensembles , 2011, SDM.

[18]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[19]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[20]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[21]  Manuele Bicego,et al.  Biclustering of Expression Microarray Data with Topic Models , 2010, 2010 20th International Conference on Pattern Recognition.