论文信息 - Sparse Biclustering of Transposable Data

Sparse Biclustering of Transposable Data

We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log-likelihood. We apply an ℓ1 penalty to the means of the biclusters to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression dataset. This article has supplementary material online.

Daniela M Witten | Kean Ming Tan | D. Witten

[1] G. Getz,et al. Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2] Arlindo L. Oliveira,et al. Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3] Genevera I. Allen,et al. TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION. , 2009, The annals of applied statistics.

[4] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5] Wojtek J. Krzanowski,et al. Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[6] R. Varga,et al. Proof of Theorem 4 , 1983 .

[7] Wei Pan,et al. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. , 2008, Electronic journal of statistics.

[8] A. Rukhin. Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[9] Ji Zhu,et al. Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[10] J. Franklin,et al. The elements of statistical learning: data mining, inference and prediction , 2005 .

[11] Chris H. Q. Ding,et al. Spectral Relaxation for K-means Clustering , 2001, NIPS.

[12] Xiaotong Shen,et al. Penalized model-based clustering with cluster-specic diagonal covariances and grouped variables , 2008 .

[13] I. Dhillon,et al. Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14] L. Lazzeroni. Plaid models for gene expression data , 2000 .

[15] J. Hartigan. Direct Clustering of a Data Matrix , 1972 .

[16] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[17] Wei Pan,et al. Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[18] Jun S Liu,et al. Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[19] Jianhua Z. Huang,et al. Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[20] Aidong Zhang,et al. Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[21] Krista Rizman Zalik,et al. Biclustering of gene expression data , 2005 .

[22] Inderjit S. Dhillon,et al. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[23] Mahmoud Mounir,et al. On biclustering of gene expression data , 2015, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS).

[24] R. Tibshirani,et al. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[25] A. Nobel,et al. Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[26] R. Tibshirani,et al. Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[27] A. Nobel,et al. Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[28] Adrian E. Raftery,et al. Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[29] R. Tibshirani,et al. Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[30] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[31] Robert Tibshirani,et al. A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[32] Robert Tibshirani,et al. Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.

[33] ThieleLothar,et al. A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[34] Ulrich Bodenhofer,et al. FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..