Spike-and-slab Lasso biclustering

Biclustering methods simultaneously group samples and their associated features. In this way, biclustering methods differ from traditional clustering methods, which utilize the entire set of features to distinguish groups of samples. Motivating applications for biclustering include genomics data, where the goal is to cluster patients or samples by their gene expression profiles; and recommender systems, which seek to group customers based on their product preferences. Biclusters of interest often manifest as rank-1 submatrices of the data matrix. This submatrix detection problem can be viewed as a factor analysis problem in which both the factors and loadings are sparse. In this paper, we propose a new biclustering method called Spike-and-Slab Lasso Biclustering (SSLB) which utilizes the Spike-and-Slab Lasso of Ročková and George (2018) to find such a sparse factorization of the data matrix. SSLB also incorporates an Indian Buffet Process prior to automatically choose the number of biclusters. Many biclustering methods make assumptions about the size of the latent biclusters; either assuming that the biclusters are all of the same size, or that the biclusters are very large or very small. In contrast, SSLB can adapt to find biclusters which have a continuum of sizes. SSLB is implemented via a fast EM algorithm with a variational step. In a variety of simulation settings, SSLB outperforms other biclustering methods. We apply SSLB to both a microarray dataset and a single-cell RNA-sequencing dataset and highlight that SSLB can recover biologically meaningful structures in the data. The SSLB software is available as an R/C++ package at https://github.com/gemoran/SSLB.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[3]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[4]  Paul Polakis,et al.  Wnt signaling in cancer. , 2012, Cold Spring Harbor perspectives in biology.

[5]  Aaditya V. Rangan,et al.  A loop-counting method for covariate-corrected low-rank biclustering of gene-expression and genome-wide association study data , 2018, PLoS Comput. Biol..

[6]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[7]  Z. Estrov,et al.  Leukemia‐inhibitory factor stimulates breast, kidney and prostate cancer cell proliferation by paracrine and autocrine pathways , 1996, International journal of cancer.

[8]  E. George,et al.  Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity , 2016 .

[9]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[10]  David B. Dunson,et al.  Generalized Beta Mixtures of Gaussians , 2011, NIPS.

[11]  Anindya Bhattacharya,et al.  A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules , 2017, Scientific Reports.

[12]  Sameer K. Deshpande,et al.  Simultaneous Variable and Covariance Selection With the Multivariate Spike-and-Slab LASSO , 2017, Journal of Computational and Graphical Statistics.

[13]  Xiaotong Shen,et al.  Personalized Prediction and Sparsity Pursuit in Latent Factor Models , 2016 .

[14]  Peter A. Calabresi,et al.  Spike and-Slab Group LASSOs for Grouped Regression and Sparse Generalized Additive Models , 2019 .

[15]  V. Rocková,et al.  Bayesian estimation of sparse signals with a continuous spike-and-slab prior , 2018 .

[16]  Panos M. Pardalos,et al.  Recent Advances of Data Biclustering with Application in Computational Neuroscience , 2010 .

[17]  Yee Whye Teh,et al.  Stick-breaking Construction for the Indian Buffet Process , 2007, AISTATS.

[18]  James Clerk Maxwell,et al.  V. Illustrations of the dynamical theory of gases.—Part I. On the motions and collisions of perfectly elastic spheres , 1860 .

[19]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[20]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[21]  A. V. D. Vaart,et al.  Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences , 2012, 1211.1197.

[22]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[23]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[24]  A. V. D. Vaart,et al.  BAYESIAN LINEAR REGRESSION WITH SPARSE PRIORS , 2014, 1403.0735.

[25]  Ümit V. Çatalyürek,et al.  A Biclustering Method to Discover Co-regulated Genes Using Diverse Gene Expression Datasets , 2009, BICoB.

[26]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[27]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[28]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[29]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[30]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[31]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[32]  Gavin D. Grant,et al.  Common markers of proliferation , 2006, Nature Reviews Cancer.

[33]  A. Onitilo,et al.  Breast Cancer Subtypes Based on ER/PR and Her2 Expression: Comparison of Clinicopathologic Features and Survival , 2009, Clinical Medicine & Research.

[34]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[35]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[36]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[37]  Sven Bergmann,et al.  Modular analysis of gene expression data with R , 2010, Bioinform..

[38]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[39]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[40]  Chuan Gao,et al.  Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering , 2016, PLoS Comput. Biol..

[41]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[42]  Kathleen A Cronin,et al.  US incidence of breast cancer subtypes defined by joint hormone receptor and HER2 status. , 2014, Journal of the National Cancer Institute.

[43]  Panos M. Pardalos,et al.  Data Mining in Agriculture , 2008 .

[44]  Fabrício Olivetti de França,et al.  Evaluating the Performance of a Biclustering Algorithm Applied to Collaborative Filtering - A Comparative Analysis , 2007, 7th International Conference on Hybrid Intelligent Systems (HIS 2007).

[45]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[46]  Karl Rohe,et al.  Vintage Factor Analysis with Varimax Performs Statistical Inference , 2020 .

[47]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[48]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[49]  Gemma E. Moran,et al.  Variance Prior Forms for High-Dimensional Bayesian Variable Selection , 2018, Bayesian Analysis.

[50]  Hedibert Freitas Lopes,et al.  Parsimonious Bayesian Factor Analysis when the Number of Factors is Unknown , 2010 .

[51]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[52]  Mário A. T. Figueiredo,et al.  Spike and slab biclustering , 2017, Pattern Recognit..

[53]  I. Weissman,et al.  Stem cells, cancer, and cancer stem cells , 2001, Nature.

[54]  Yee Whye Teh,et al.  Variational Inference for the Indian Buffet Process , 2009, AISTATS.

[55]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[57]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[58]  E. George,et al.  The Spike-and-Slab LASSO , 2018 .