Ensemble learning models that predict surface protein abundance from single-cell multimodal omics data.

Single-cell protein abundance is a fundamental type of information to characterize cell states. Due to high cost and technical barriers, however, direct quantification of proteins is difficult. Single-cell RNA sequencing (scRNA-seq) data, serving as a cost-effective substitute of single-cell proteomics, may not accurately reflect protein expression levels due to measurement error, noise, post-transcriptional and translational regulation, etc. The recently emerging single-cell multimodal omics data, e.g. CITE-seq and REAP-seq, can simultaneously profile RNA and protein abundances in single cells, providing labeled data for predictive modeling in a supervised learning framework. Deep neural network-based transfer learning method has been applied to imputation of surface protein abundance from single-cell transcriptomic data. However, it is unclear if the artificial neural network is the best model, and it is desirable to improve the prediction performance (e.g. accuracy, interpretability) of machine learning models. In this paper, we compared several tree-based ensemble learning methods with neural network models, and found that ensemble learning often performed better than neural network, and Random Forest (RF) performed the best overall. Moreover, we used the feature importance scores from RF to interpret biological mechanisms underlying the prediction. Our study demonstrates the effectiveness of ensemble learning for reliable protein abundance prediction using single-cell multimodal omics data, and paves the way for knowledge discovery by mining single-cell multi-omics data in large scale.

[1]  Tao Jiang,et al.  SCALE method for single-cell ATAC-seq analysis via latent feature extraction , 2019, Nature Communications.

[2]  Sijia Lu,et al.  Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications. , 2015, Annual review of genomics and human genetics.

[3]  Stephan Beck,et al.  Making multi-omics data accessible to researchers , 2019, Scientific Data.

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  H. Swerdlow,et al.  Large-scale simultaneous measurement of epitopes and transcriptomes in single cells , 2017, Nature Methods.

[6]  Mohamed Ettaouil,et al.  Multilayer Perceptron: Architecture Optimization and Training , 2016, Int. J. Interact. Multim. Artif. Intell..

[7]  Paul J. Hoffman,et al.  Comprehensive Integration of Single-Cell Data , 2018, Cell.

[8]  Jingshu Wang,et al.  Surface protein imputation from single cell transcriptomes by deep neural networks , 2020, Nature Communications.

[9]  R. Aebersold,et al.  On the Dependency of Cellular Protein Levels on mRNA Abundance , 2016, Cell.

[10]  Vanessa M. Peterson,et al.  Multiplexed quantification of proteins and transcripts in single cells , 2017, Nature Biotechnology.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[13]  J. Lee,et al.  Single-cell RNA sequencing technologies and bioinformatics pipelines , 2018, Experimental & Molecular Medicine.

[14]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[15]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Vivien Marx,et al.  A dream of single-cell proteomics , 2019, Nature Methods.

[18]  Olga Tanaseichuk,et al.  Metascape provides a biologist-oriented resource for the analysis of systems-level datasets , 2019, Nature Communications.

[19]  A. Oudenaarden,et al.  Nature, Nurture, or Chance: Stochastic Gene Expression and Its Consequences , 2008, Cell.

[20]  F. Edfors,et al.  Gene‐specific correlation of RNA and protein levels in human cells and tissues , 2016, Molecular systems biology.

[21]  Jingshu Wang,et al.  Data denoising with transfer learning in single-cell transcriptomics , 2019, Nature Methods.

[22]  Rebecca Sanders Precision in RNA molecular measurement , 2016 .