CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

Background Bayesian factorization methods, including Coordinated Gene Activity in Pattern Sets (CoGAPS), are emerging as powerful analysis tools for single cell data. However, these methods have greater computational costs than their gradient-based counterparts. These costs are often prohibitive for analysis of large single-cell datasets. Many such methods can be run in parallel which enables this limitation to be overcome by running on more powerful hardware. However, the constraints imposed by the prior distributions in CoGAPS limit the applicability of parallelization methods to enhance computational efficiency for single-cell analysis. Results We developed a new software framework for parallel matrix factorization in Version 3 of the CoGAPS R/Bioconductor package to overcome the computational limitations of Bayesian matrix factorization for single cell data analysis. This parallelization framework provides asynchronous updates for sequential updating steps of the algorithm to enhance computational efficiency. These algorithmic advances were coupled with new software architecture and sparse data structures to reduce the memory overhead for single-cell data. Conclusions Altogether our new software enhance the efficiency of the CoGAPS Bayesian matrix factorization algorithm so that it can analyze 1000 times more cells, enabling factorization of large single-cell data sets.

[1]  S. Weissman,et al.  Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization , 2017, PeerJ.

[2]  Jie Ding,et al.  CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data , 2010, Bioinform..

[3]  Carlo Colantuoni,et al.  Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species , 2018, bioRxiv.

[4]  Michael F. Ochs,et al.  Matrix factorization for transcriptional regulatory network inference , 2012, 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[5]  Pardis C. Sabeti,et al.  Identifying Gene Expression Programs of Cell-type Identity and Cellular Activity with Single-Cell RNA-Seq , 2018 .

[6]  Ahn Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC , 2015 .

[7]  Brian S. Clark,et al.  Single-Cell RNA-Seq Analysis of Retinal Development Identifies NFI Factors as Regulating Mitotic Exit and Late-Born Cell Specification , 2019, Neuron.

[8]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[9]  Bin Wu,et al.  A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization , 2014, BigMine.

[10]  A. Regev,et al.  Efficient Generation of Transcriptomic Profiles by Random Composite Measurements , 2017, Cell.

[11]  Alexander V. Favorov,et al.  Enter the Matrix: Factorization Uncovers Knowledge from Omics , 2018, Trends in genetics : TIG.

[12]  Michael F. Ochs,et al.  PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF , 2016, bioRxiv.

[13]  Ole Winther,et al.  Bayesian Non-negative Matrix Factorization , 2009, ICA.

[14]  Pardis C Sabeti,et al.  Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq , 2018, bioRxiv.

[15]  Yong Wang,et al.  Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations , 2018, Proceedings of the National Academy of Sciences.

[16]  Evan Z. Macosko,et al.  Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity , 2019, Cell.