Alevin: An integrated method for dscRNA-seq quantification

We introduce alevin, an efficient pipeline for gene quantification from dscRNA-seq (droplet-based single-cell RNA-seq) data. Alevin is an end-to-end quantification pipeline that starts from sample-demultiplexed FASTQ files and generates gene-level counts for two popular droplet-based sequencing protocols (drop-seq [1], and 10x-chromium [2]). Importantly, alevin handles all processing internally, avoiding reliance on external pipeline programs, and the need to write large intermediate files to disk. Alevin adopts efficient algorithms for cellular-barcode whitelist generation, cellular-barcode correction, lightweight per-cell UMI deduplication and quantification. This integrated solution allows alevin to process data much faster (typically ∼ 10 times faster) than other approaches, while also working within a reasonable memory budget. This enables full, end-to-end analysis for single-cell human experiment consisting of ∼ 4500 cells with 335 Million reads with 13G of RAM and 8 threads (of an Intel Xeon E5-2699 v4 CPU) in 27 minutes.

[1]  S. Mirarab,et al.  Sequence Analysis , 2020, Encyclopedia of Bioinformatics and Computational Biology.

[2]  Brent S. Pedersen,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[3]  Viktor Petukhov,et al.  dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments , 2018, Genome Biology.

[4]  Matthew D. Young,et al.  SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data , 2018, bioRxiv.

[5]  Aaron T. L. Lun,et al.  Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data , 2018 .

[6]  Lu Zhao,et al.  Bartender: a fast and accurate clustering algorithm to count barcode reads , 2018, Bioinform..

[7]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[8]  Wei Wang,et al.  Fleximer: Accurate Quantification of RNA-Seq via Variable-Length k-mers , 2017, BCB.

[9]  Lior Pachter,et al.  Barcode identification for single cell genomics , 2017, BMC Bioinformatics.

[10]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[11]  Andreas Heger,et al.  UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy , 2016, bioRxiv.

[12]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[13]  Lior Pachter,et al.  Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts , 2016, Genome Biology.

[14]  Robert Patro,et al.  RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes , 2015, bioRxiv.

[15]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[16]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[17]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[18]  S. Linnarsson,et al.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing , 2014, Nature Neuroscience.

[19]  Wei Wang,et al.  RNA-Skim: a rapid method for RNA-Seq quantification at transcript level , 2014, Bioinform..

[20]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[21]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[22]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[23]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[24]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[25]  B. Korte,et al.  Combinatorial Optimization: Theory and Algorithms , 2007 .

[26]  Mohit Singh,et al.  Approximation Algorithms , 1997, Encyclopedia of Algorithms.

[27]  J. Köster,et al.  Snakemake - a scalable bioinformatics workflow engine , 2018, Bioinform..

[28]  Tao Zhang,et al.  An Information Flow Analysis of a Distributed Information System for Space Medical Support , 2004, MedInfo.