Binned multinomial logistic regression for integrative cell type annotation

Categorizing individual cells into one of many known cell type categories, also known as cell type annotation, is a critical step in the analysis of single-cell genomics data. The current process of annotation is time-intensive and subjective, which has led to different studies describing cell types with labels of varying degrees of resolution. While supervised learning approaches have provided automated solutions to annotation, there remains a significant challenge in fitting a unified model for multiple datasets with inconsistent labels. In this article, we propose a new multinomial logistic regression estimator which can be used to model cell type probabilities by integrating multiple datasets with labels of varying resolution. To compute our estimator, we solve a nonconvex optimization problem using a blockwise proximal gradient descent algorithm. We show through simulation studies that our approach estimates cell type probabilities more accurately than competitors in a wide variety of scenarios. We apply our method to ten single-cell RNA-seq datasets and demonstrate its utility in predicting fine resolution cell type labels on unlabeled data as well as refining cell type labels on data with existing coarse resolution annotations. An R package implementing the method is available at https://github.com/keshav-motwani/IBMR and the collection of datasets we analyze is available at https://github.com/keshav-motwani/AnnotatedPBMC.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  Michael I. Jordan,et al.  Union support recovery in high-dimensional multivariate regression , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[3]  Wotao Yin,et al.  A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update , 2014, J. Sci. Comput..

[4]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[5]  Hao Wu,et al.  Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction , 2021, Genome biology.

[6]  Lorenzo Trippa,et al.  Integration of Survival Data from Multiple Studies , 2020 .

[7]  Fan Zhang,et al.  Fast, sensitive, and accurate integration of single cell data with Harmony , 2018, bioRxiv.

[8]  Raphael Gottardo,et al.  Orchestrating single-cell analysis with Bioconductor , 2019, Nature Methods.

[9]  Adam J. Rothman,et al.  A likelihood-based approach for multivariate categorical response regression in high dimensions , 2021, Journal of the American Statistical Association.

[10]  W. Lau,et al.  Time-resolved systems immunology reveals a late juncture linked to fatal COVID-19 , 2021, Cell.

[11]  Raphael Gottardo,et al.  Superscan: Supervised Single-Cell Annotation , 2021, bioRxiv.

[12]  Jian Huang,et al.  Integrative analysis of ‘‐omics’ data using penalty functions , 2015, Wiley interdisciplinary reviews. Computational statistics.

[13]  Bonnie Berger,et al.  Efficient integration of heterogeneous single-cell transcriptomes using Scanorama , 2019, Nature Biotechnology.

[14]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[15]  E. Ooi,et al.  Faculty Opinions recommendation of Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus. , 2020 .

[16]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[17]  Shuangge Ma,et al.  Promoting Similarity of Sparsity Structures in Integrative Analysis With Penalization , 2015, Journal of the American Statistical Association.

[18]  Matthew D. Young,et al.  SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data , 2018, bioRxiv.

[19]  Marcel J. T. Reinders,et al.  A comparison of automatic cell identification methods for single-cell RNA sequencing data , 2019, Genome Biology.

[20]  Mark M. Davis,et al.  Multi-Omics Resolves a Sharp Disease-State Shift between Mild and Moderate COVID-19 , 2020, Cell.

[21]  Rohit K. Patra,et al.  Dimension reduction for integrative survival analysis. , 2021, Biometrics.

[22]  Laura J. Simpson,et al.  A single-cell atlas of the peripheral immune response in patients with severe COVID-19 , 2020, Nature Medicine.

[23]  V. Busskamp,et al.  Automated methods for cell type annotation on scRNA-seq data , 2021, Computational and structural biotechnology journal.

[24]  Raphael Gottardo,et al.  Integrated analysis of multimodal single-cell data , 2020, Cell.

[25]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[26]  Kenneth Lange,et al.  MM optimization algorithms , 2016 .

[27]  Orit Rozenblatt-Rosen,et al.  Systematic comparative analysis of single cell RNA-sequencing methods , 2019, bioRxiv.

[28]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[29]  James G. Scott,et al.  Proximal Algorithms in Statistics and Machine Learning , 2015, ArXiv.