Sparse Biclustering of Transposable Data

We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log-likelihood. We apply an ℓ1 penalty to the means of the biclusters to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression dataset. This article has supplementary material online.

[1]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Genevera I. Allen,et al.  TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION. , 2009, The annals of applied statistics.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[6]  R. Varga,et al.  Proof of Theorem 4 , 1983 .

[7]  Wei Pan,et al.  Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. , 2008, Electronic journal of statistics.

[8]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[9]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[10]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[11]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[12]  Xiaotong Shen,et al.  Penalized model-based clustering with cluster-specic diagonal covariances and grouped variables , 2008 .

[13]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[15]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[18]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[19]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[20]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[21]  Krista Rizman Zalik,et al.  Biclustering of gene expression data , 2005 .

[22]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[23]  Mahmoud Mounir,et al.  On biclustering of gene expression data , 2015, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS).

[24]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[25]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[26]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[27]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[28]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[29]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[30]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[31]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[32]  Robert Tibshirani,et al.  Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.

[33]  ThieleLothar,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[34]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..