Velo-Predictor: an ensemble learning pipeline for RNA velocity prediction

Background RNA velocity is a novel and powerful concept which enables the inference of dynamical cell state changes from seemingly static single-cell RNA sequencing (scRNA-seq) data. However, accurate estimation of RNA velocity is still a challenging problem, and the underlying kinetic mechanisms of transcriptional and splicing regulations are not fully clear. Moreover, scRNA-seq data tend to be sparse compared with possible cell states, and a given dataset of estimated RNA velocities needs imputation for some cell states not yet covered. Results We formulate RNA velocity prediction as a supervised learning problem of classification for the first time, where a cell state space is divided into equal-sized segments by directions as classes, and the estimated RNA velocity vectors are considered as ground truth. We propose Velo-Predictor, an ensemble learning pipeline for predicting RNA velocities from scRNA-seq data. We test different models on two real datasets, Velo-Predictor exhibits good performance, especially when XGBoost was used as the base predictor. Parameter analysis and visualization also show that the method is robust and able to make biologically meaningful predictions. Conclusion The accurate result shows that Velo-Predictor can effectively simplify the procedure by learning a predictive model from gene expression data, which could help to construct a continous landscape and give biologists an intuitive picture about the trend of cellular dynamics.

[1]  Tao Peng,et al.  scEpath: energy landscape-based inference of transition probabilities and cellular trajectories from single-cell transcriptomic data , 2018, Bioinform..

[2]  W. Maas,et al.  The potential for the formation of a biosynthetic enzyme in Escherichia coli. , 1957, Biochimica et biophysica acta.

[3]  Fabian J Theis,et al.  Generalizing RNA velocity to transient cell states through dynamical modeling , 2019, Nature Biotechnology.

[4]  Jing Guo,et al.  HopLand: single-cell pseudotime recovery using continuous Hopfield network-based modeling of Waddington’s epigenetic landscape , 2017, Bioinform..

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[7]  Patrick Lucey,et al.  Where Will They Go? Predicting Fine-Grained Adversarial Multi-agent Motion Using Conditional Variational Autoencoders , 2018, ECCV.

[8]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods , 2019, Nature Biotechnology.

[9]  A. Bhardwaj,et al.  In situ click chemistry generation of cyclooxygenase-2 inhibitors , 2017, Nature Communications.

[10]  Casper Kaae Sønderby,et al.  scVAE: Variational auto-encoders for single-cell gene expression data , 2018, bioRxiv.

[11]  S. Teichmann,et al.  Exponential scaling of single-cell RNA-seq in the past decade , 2017, Nature Protocols.

[12]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Erik Sundström,et al.  RNA velocity of single cells , 2018, Nature.

[14]  A. Teschendorff,et al.  Single-cell entropy for accurate estimation of differentiation potency from a cell's transcriptome , 2017, Nature Communications.

[15]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, bioRxiv.

[16]  Fabian J. Theis,et al.  Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis , 2019, Development.

[17]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[18]  E. Marco,et al.  Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape , 2014, Proceedings of the National Academy of Sciences.

[19]  Hung T. Nguyen,et al.  Fast unsupervised learning method for rapid estimation of cluster centroids , 2012, 2012 IEEE Congress on Evolutionary Computation.

[20]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[21]  Eytan Domany,et al.  Coupled pre-mRNA and mRNA dynamics unveil operational strategies underlying transcriptional responses to stimuli , 2013 .

[22]  Caleb Weinreb,et al.  Fundamental limits on dynamic inference from single-cell snapshots , 2017, Proceedings of the National Academy of Sciences.

[23]  S. Linnarsson,et al.  Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing , 2018, Nature Neuroscience.

[24]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[25]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[26]  Fabian J Theis,et al.  Generalizing RNA velocity to transient cell states through dynamical modeling , 2019, bioRxiv.

[27]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[28]  Sebastian Raschka,et al.  MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack , 2018, J. Open Source Softw..

[29]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[30]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[31]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[32]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[33]  Hien M. Nguyen,et al.  Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Neil D. Lawrence,et al.  Topslam: Waddington Landscape Recovery for Single Cell Experiments , 2016, bioRxiv.

[36]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..