Deep surveys of transcriptional modules with Massive Associative K-biclustering (MAK)

Biclustering can reveal functional patterns in common biological data such as gene expression. Biclusters are ordered submatrices of a larger matrix that represent coherent data patterns. A critical requirement for biclusters is high coherence across a subset of columns, where coherence is defined as a fit to a mathematical model of similarity or correlation. Biclustering, though powerful, is NP-hard, and existing biclustering methods implement a wide variety of approximations to achieve tractable solutions for real world datasets. High bicluster coherence becomes more computationally expensive to achieve with high dimensional data, due to the search space size and because the number, size, and overlap of biclusters tends to increase. This complicates an already difficult problem and leads existing methods to find smaller, less coherent biclusters. Our unsupervised Massive Associative K-biclustering (MAK) approach corrects this size bias while preserving high bicluster coherence both on simulated datasets with known ground truth and on real world data without, where we apply a new measure to evaluate biclustering. Moreover, MAK jointly maximizes bicluster coherence with biological enrichment and finds the most enriched biological functions. Another long-standing problem with these methods is the overwhelming data signal related to ribosomal functions and protein production, which can drown out signals for less common but therefore more interesting functions. MAK reports the second-most enriched non-protein production functions, with higher bicluster coherence and arrayed across a large number of biclusters, demonstrating its ability to alleviate this biological bias and thus reflect the mediation of multiple biological processes rather than recruitment of processes to a small number of major cell activities. Finally, compared to the union of results from 11 top biclustering methods, MAK finds 21 novel S. cerevisiae biclusters. MAK can generate high quality biclusters in large biological datasets, including simultaneous integration of up to four distinct biological data types. Author summary Biclustering can reveal functional patterns in common biological data such as gene expression. A critical requirement for biclusters is high coherence across a subset of columns, where coherence is defined as a fit to a mathematical model of similarity or correlation. Biclustering, though powerful, is NP-hard, and existing biclustering methods implement a wide variety of approximations to achieve tractable solutions for real world datasets. This complicates an already difficult problem and leads existing biclustering methods to find smaller and less coherent biclusters. Using the MAK methodology we can correct the bicluster size bias while preserving high bicluster coherence on simulated datasets with known ground truth as well as real world datasets, where we apply a new data driven bicluster set score. MAK jointly maximizes bicluster coherence with biological enrichment and finds more enriched biological functions, including other than protein production. These functions are arrayed across a large number of MAK biclusters, demonstrating ability to alleviate this biological bias and reflect the mediation of multiple biological processes rather than recruitment of processes to a small number of major cell activities. MAK can generate high quality biclusters in large biological datasets, including simultaneous integration of up to four distinct biological data types.

[1]  M. Ziemann,et al.  Urgent need for consistent standards in functional enrichment analysis , 2022, PLoS Comput. Biol..

[2]  C. Roberts,et al.  Foundation , 2000, The Fairchild Books Dictionary of Fashion.

[3]  A. Zinovyev,et al.  Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data , 2021, bioRxiv.

[4]  Xiangyu Liu,et al.  RecBic: a fast and accurate algorithm recognizing trend-preserving biclusters , 2020, Bioinform..

[5]  Bingqiang Liu,et al.  QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data , 2019, Bioinform..

[6]  Marzia A. Cremona,et al.  On the bias of H-scores for comparing biclusters, and how to correct it , 2019, Bioinform..

[7]  Arthur Flexer,et al.  A comprehensive empirical comparison of hubness reduction in high-dimensional spaces , 2018, Knowledge and Information Systems.

[8]  Yvan Saeys,et al.  A comprehensive evaluation of module detection methods for gene expression data , 2018, Nature Communications.

[9]  Russ B. Altman,et al.  A global network of biomedical relationships derived from text , 2018, Bioinform..

[10]  Jason H. Moore,et al.  EBIC: an evolutionary‐based parallel biclustering algorithm for pattern discovery , 2018, Bioinform..

[11]  Hyojin Kim,et al.  TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions , 2017, Nucleic Acids Res..

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  Jason H. Moore,et al.  runibic: a Bioconductor package for parallel row-based biclustering of gene expression data , 2017, bioRxiv.

[14]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[15]  Chuan Gao,et al.  Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering , 2016, PLoS Comput. Biol..

[16]  Zhenjia Wang,et al.  UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data , 2016, Scientific Reports.

[17]  Nitin S. Baliga,et al.  cMonkey2: Automated, systematic, integrated detection of co-regulated gene modules for any organism , 2015, Nucleic acids research.

[18]  Amy C. Kelly,et al.  Saccharomyces cerevisiae , 2013, Prion.

[19]  Edmund J Crampin,et al.  Biclustering reveals breast cancer tumour subgroups with common clinical features and improves prediction of disease recurrence , 2013, BMC Genomics.

[20]  Shifeng Xue,et al.  Specialized ribosomes: a new frontier in gene regulation and organismal biology , 2012, Nature Reviews Molecular Cell Biology.

[21]  Ujjwal Maulik,et al.  A Novel Biclustering Approach to Association Rule Mining for Predicting HIV-1–Human Protein Interactions , 2012, PloS one.

[22]  Shifeng Xue,et al.  Ribosome-Mediated Specificity in Hox mRNA Translation and Vertebrate Tissue Patterning , 2011, Cell.

[23]  G Hripcsak,et al.  Biclustering of Adverse Drug Events in the FDA's Spontaneous Reporting System , 2011, Clinical pharmacology and therapeutics.

[24]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[25]  Alexandre P. Francisco,et al.  YEASTRACT: providing a programmatic access to curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web services interface , 2010, Nucleic Acids Res..

[26]  Mourad Elloumi,et al.  Biclustering of Microarray Data , 2010 .

[27]  David L. Robertson,et al.  Patterns of HIV-1 Protein Interaction Identify Perturbed Host-Cellular Subsystems , 2010, PLoS Comput. Biol..

[28]  Hyungwon Choi,et al.  Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data , 2010, Molecular systems biology.

[29]  A. Philippakis,et al.  Inferring condition‐specific transcription factor function from DNA binding and gene expression data , 2007 .

[30]  Inna Dubchak,et al.  MicrobesOnline: an integrated portal for comparative and functional genomics , 2009, Nucleic Acids Res..

[31]  Olga G. Troyanskaya,et al.  Detailing regulatory networks through large scale data integration , 2009, Bioinform..

[32]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[33]  Daniel E. Newburger,et al.  High-resolution DNA-binding specificity analysis of yeast transcription factors. , 2009, Genome research.

[34]  Yoshifumi Okada,et al.  High-performance gene expression module analysis tool and its application to chemical toxicity data. , 2009, Methods in molecular biology.

[35]  P. Bushel,et al.  Discernment of possible mechanisms of hepatotoxicity via biological processes over-represented by co-expressed genes , 2009, BMC Genomics.

[36]  Christodoulos A. Floudas,et al.  Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies , 2008, BMC Bioinformatics.

[37]  Shu Wang,et al.  Biclustering as a method for RNA local multiple sequence alignment , 2007, Bioinform..

[38]  P. Hurban,et al.  Identification of Primary Transcriptional Regulation of Cell Cycle-Regulated Genes upon DNA Damage , 2007, Cell cycle.

[39]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[40]  David J. Reiss,et al.  Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks , 2006, BMC Bioinformatics.

[41]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[42]  G. Church,et al.  A global view of pleiotropy and phenotypically derived gene function in yeast , 2005, Molecular systems biology.

[43]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[44]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[46]  Patricia De la Vega,et al.  Discovery of Gene Function by Expression Profiling of the Malaria Parasite Life Cycle , 2003, Science.

[47]  Felix Famoye,et al.  Plane Answers to Complex Questions: Theory of Linear Models , 2003, Technometrics.

[48]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[49]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[50]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[52]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[53]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[54]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[55]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[56]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[57]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[58]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[59]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[60]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[61]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[62]  M. Kendall,et al.  The Problem of $m$ Rankings , 1939 .