Comparison of sparse biclustering algorithms for gene expression datasets

Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, in heterogeneous samples (e.g. different tissues or disease states), genes may cluster differently. Biclustering algorithms aim to solve this issue by performing sample clustering and gene clustering simultaneously. Existing reviews of biclustering algorithms have yet to include a number of more recent algorithms and have based comparisons on simplistic simulated datasets without specific evaluation of biclusters in real datasets, using less robust metrics. In this study we compared four classes of sparse biclustering algorithms on a range of simulated and real datasets. In particular we use a knockout mouse RNA-seq dataset to evaluate each algorithm’s ability to simultaneously cluster genes and cluster samples across multiple tissues. We found that Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and didn’t require any post-processing, but were considerably slower than other algorithm classes. We assessed whether non-negative matrix factorisation algorithms can be repurposed for biclustering and found that, although the raw output was poor, after using a sparsity-inducing post-processing procedure we introduce, one such algorithm was one of the most highly ranked on real datasets. We also exhibit the limitations of biclustering algorithms by varying the complexity of simulated datasets. The algorithms generally struggled on simulated datasets with a large number of implanted factors, or with a large number of genes. In real datasets, the algorithms rarely returned clusters containing samples from multiple tissues, which highlights the need for further thought in the design and analysis of multi-tissue studies to avoid differences between tissues dominating the analysis. Code to run the analysis is available at https://github.com/nichollskc/biclust_comp, including wrappers for each algorithm, implementations of evaluation metrics, and code to simulate datasets and perform pre- and post-processing. The full tables of results are available at https://doi.org/10.5281/zenodo.4317556

[1]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[4]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[6]  David B. West,et al.  Transcriptome Analysis of Targeted Mouse Mutations Reveals the Topography of Local Changes in Gene Expression , 2016, PLoS genetics.

[7]  Chuan Gao,et al.  Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering , 2016, PLoS Comput. Biol..

[8]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[9]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[10]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[12]  J. E. Richardson,et al.  MouseMine: a new data warehouse for MGI , 2015, Mammalian Genome.

[13]  Edward I. George,et al.  Spike-and-slab Lasso biclustering , 2021 .

[14]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ian C. McDowell,et al.  Differential gene co-expression networks via Bayesian biclustering models , 2014, 1411.1997.

[16]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[17]  Kirill Veselkov,et al.  Bi-clustering of metabolic data using matrix factorization tools , 2018, Methods.

[18]  Lincoln Stein,et al.  Reactome pathway analysis: a high-performance in-memory approach , 2017, BMC Bioinformatics.

[19]  Ricardo J. G. B. Campello,et al.  Similarity Measures for Comparing Biclusterings , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Yvan Saeys,et al.  A comprehensive evaluation of module detection methods for gene expression data , 2018, Nature Communications.

[21]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[22]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[23]  M. McCarthy,et al.  Tensor decomposition for multi-tissue gene expression experiments , 2016, Nature Genetics.

[24]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[25]  Yun S. Song,et al.  THREE-WAY CLUSTERING OF MULTI-TISSUE MULTI-INDIVIDUAL GENE EXPRESSION DATA USING SEMI-NONNEGATIVE TENSOR DECOMPOSITION. , 2019, The annals of applied statistics.

[26]  Gautier Koscielny,et al.  The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data , 2013, Nucleic Acids Res..