Integration of millions of transcriptomes using batch-aware triplet neural networks

Efficient integration of heterogeneous and increasingly large single-cell RNA sequencing data poses a major challenge for analysis and, in particular, comprehensive atlasing efforts. Here we developed a novel deep learning algorithm called INSCT (Insight) to overcome batch effects using batch-aware triplet neural networks. We use simulated and real data to demonstrate that INSCT generates an embedding space that accurately integrates cells across experiments, platforms and species. Our benchmark comparisons with current state-of-the-art single-cell RNA sequencing integration methods revealed that INSCT outperforms competing methods in scalability while achieving comparable accuracies. Moreover, using INSCT in semisupervised mode enables users to classify unlabelled cells by projecting them into a reference collection of annotated cells. To demonstrate scalability, we applied INSCT to integrate more than 2.6 million transcriptomes from four independent studies of mouse brains in less than 1.5 h using less than 25 GB of memory. This feature empowers researchers to perform atlasing-scale data integration in a typical desktop computer environment. INSCT is freely available at https://github.com/lkmklsmn/insct . Single-cell RNA sequencing efforts have made large amounts of data available for transcriptomics research. Simon and colleagues develop a neural network embedding approach that avoids batch effects, such that it can rapidly and efficiently integrate large datasets from different studies.

[1]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jan Niklas Böhm,et al.  A Unifying Perspective on Neighbor Embeddings along the Attraction-Repulsion Spectrum , 2020, ArXiv.

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  Irving L. Weissman,et al.  A single-cell transcriptomic atlas characterizes ageing tissues in the mouse , 2020, Nature.

[5]  Fabian J Theis,et al.  Single cells make big data: New challenges and opportunities in transcriptomics , 2017 .

[6]  Kerstin B. Meyer,et al.  BBKNN: fast batch alignment of single cell transcriptomes , 2019, Bioinform..

[7]  Richard A. Muscat,et al.  Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding , 2018, Science.

[8]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[9]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[10]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[11]  Zhongming Zhao,et al.  DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data , 2020, GigaScience.

[12]  Michael I. Jordan,et al.  Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models , 2019, bioRxiv.

[13]  Benjamin Szubert,et al.  Structure-preserving visualisation of high dimensional single-cell datasets , 2019, Scientific Reports.

[14]  Fabian J Theis,et al.  PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells , 2019, Genome Biology.

[15]  Kok Siong Ang,et al.  A benchmark of batch-effect correction methods for single-cell RNA sequencing data , 2020, Genome Biology.

[16]  Bonnie Berger,et al.  Efficient integration of heterogeneous single-cell transcriptomes using Scanorama , 2019, Nature Biotechnology.

[17]  Rhonda Bacher,et al.  Design and computational analysis of single-cell RNA-sequencing experiments , 2016, Genome Biology.

[18]  Kamil Slowikowski,et al.  Fast, sensitive, and accurate integration of single cell data with Harmony , 2019, Nature Methods.

[19]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[20]  Paul J. Hoffman,et al.  Comprehensive Integration of Single-Cell Data , 2018, Cell.

[21]  Mohammad Lotfollahi,et al.  scGen predicts single-cell perturbation responses , 2019, Nature Methods.

[22]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[23]  Raffaella Casadei,et al.  An estimation of the number of cells in the human body , 2013, Annals of human biology.

[24]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[25]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  A. Regev,et al.  Molecular Classification and Comparative Taxonomics of Foveal and Peripheral Cells in Primate Retina , 2018, Cell.

[28]  David van Dijk,et al.  Exploring single-cell data with deep multitasking neural networks , 2019, Nature Methods.

[29]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[30]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[31]  Principal Investigators,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018 .

[32]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  S. Orkin,et al.  Mapping the Mouse Cell Atlas by Microwell-Seq , 2018, Cell.

[34]  Dan Zhang,et al.  Construction of a human cell landscape at single-cell level , 2020, Nature.

[35]  M Dugas,et al.  Benchmarking atlas-level data integration in single-cell genomics , 2020, Nature Methods.

[36]  Fabian J Theis,et al.  Diffusion pseudotime robustly reconstructs lineage branching , 2016, Nature Methods.

[37]  A. Álvarez-Buylla,et al.  Neural stem cells: origin, heterogeneity and regulation in the adult mammalian brain , 2019, Development.

[38]  Sarah A Teichmann,et al.  A test metric for assessing single-cell RNA-seq batch correction , 2018, Nature Methods.

[39]  Jeff Heaton,et al.  Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning , 2017, Genetic Programming and Evolvable Machines.