Single-cell Transcriptome Study as Big Data

The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.

[1]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[2]  I. Amit,et al.  Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types , 2014, Science.

[3]  Fabrício F. Costa Big data in biomedicine. , 2014, Drug discovery today.

[4]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[5]  A. Chenchik,et al.  Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. , 2001, BioTechniques.

[6]  Yang Yu,et al.  FVGWAS: Fast voxelwise genome wide association analysis of large-scale imaging genetic data , 2015, NeuroImage.

[7]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[8]  Ron Edgar,et al.  Gene Expression Omnibus ( GEO ) : Microarray data storage , submission , retrieval , and analysis , 2008 .

[9]  Todor Ivanov,et al.  On the inequality of the 3V's of Big Data Architectural Paradigms: A case for heterogeneity , 2013, ArXiv.

[10]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[11]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[12]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[13]  Shweta S Chavan,et al.  Enhancing cancer clonality analysis with integrative genomics , 2015, BMC Bioinformatics.

[14]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[15]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[16]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[17]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[18]  L. Weiner,et al.  Investigating evolutionary perspective of carcinogenesis with single-cell transcriptome analysis , 2013, Chinese journal of cancer.

[19]  Sanguthevar Rajasekaran,et al.  LFQC: A lossless compression algorithm for FASTQ files , 2019, Bioinform..

[20]  Krishna Kumar Tiwari,et al.  Personalization of cancer treatment using predictive simulation , 2015, Journal of Translational Medicine.

[21]  Jinghua Gu,et al.  Sphinx: modeling transcriptional heterogeneity in single-cell RNA-Seq , 2015, bioRxiv.

[22]  Roger S Lasken,et al.  Single-cell genomic sequencing using Multiple Displacement Amplification. , 2007, Current opinion in microbiology.

[23]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[24]  B. Williams,et al.  Transcriptional regulation by nicotine in dopaminergic neurons. , 2013, Biochemical Pharmacology.

[25]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[26]  Raymond K. Auerbach,et al.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project , 2010, Science.

[27]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[28]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[29]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[30]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[31]  Do-Hyun Nam,et al.  Single-cell mRNA sequencing identifies subclonal heterogeneity in anti-cancer drug responses of lung adenocarcinoma cells , 2015, Genome Biology.

[32]  A. Saliba,et al.  Single-cell RNA-seq: advances and future challenges , 2014, Nucleic acids research.

[33]  Sandeep Tata,et al.  BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters , 2013, Bioinform..

[34]  B. Williams,et al.  From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing , 2014, Genome research.

[35]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[36]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[37]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[38]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[39]  S. Linnarsson,et al.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing , 2014, Nature Neuroscience.

[40]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[41]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[42]  Peggy L Peissig,et al.  SeqHBase: a big data toolset for family based sequencing data analysis , 2015, Journal of Medical Genetics.

[43]  Lin Liu,et al.  Single-cell analysis of the transcriptome and its application in the characterization of stem cells and early embryos , 2014, Cellular and Molecular Life Sciences.

[44]  Christian Schlötterer,et al.  DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster , 2013, PloS one.

[45]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[46]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[47]  A. Breman,et al.  High-recovery visual identification and single-cell retrieval of circulating tumor cells for genomic analysis using a dual-technology platform integrated with automated immunofluorescence staining , 2015, BMC Cancer.

[48]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[49]  Suzanne J. Matthews,et al.  MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees , 2010, BMC Bioinformatics.

[50]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[51]  T. Hashimshony,et al.  CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. , 2012, Cell reports.

[52]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[53]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[54]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[55]  Jangwook P. Jung,et al.  Single-Cell RNA-Seq of Bone Marrow-Derived Mesenchymal Stem Cells Reveals Unique Profiles of Lineage Priming , 2015, PloS one.

[56]  Jonathan M Irish,et al.  High-dimensional single-cell cancer biology. , 2014, Current topics in microbiology and immunology.

[57]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[58]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[59]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[60]  Philip Cayting,et al.  An encyclopedia of mouse DNA elements (Mouse ENCODE) , 2012, Genome Biology.

[61]  Rona S. Gertner,et al.  Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells , 2013, Nature.

[62]  Yu-Jin Jung,et al.  Identification of Distinct Tumor Subpopulations in Lung Adenocarcinoma via Single-Cell RNA-seq , 2015, PloS one.

[63]  D. Hebenstreit Methods, Challenges and Potentials of Single Cell RNA-seq , 2012, Biology.

[64]  Francisco Azuaje,et al.  Gene set analysis in the cloud , 2012, Bioinform..

[65]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[66]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[67]  Hugh J. Lavery,et al.  Next-generation sequencing technology in prostate cancer diagnosis, prognosis, and personalized treatment. , 2015, Urologic oncology.

[68]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[69]  Lynette Hirschman,et al.  Nephele: genotyping via complete composition vectors and MapReduce , 2011, Source Code for Biology and Medicine.

[70]  Åsa K. Björklund,et al.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells , 2013, Nature Methods.

[71]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[72]  Rona S. Gertner,et al.  Single cell RNA Seq reveals dynamic paracrine control of cellular variation , 2014, Nature.

[73]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[74]  Susan S. Taylor,et al.  ProKinO: A Unified Resource for Mining the Cancer Kinome , 2014, Human mutation.

[75]  Marco Masseroli,et al.  GenoMetric Query Language: a novel approach to large-scale genomic data management , 2015, Bioinform..

[76]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[77]  Hidetoshi Kotera,et al.  On-chip separation and analysis of RNA and DNA from single cells. , 2014, Analytical chemistry.

[78]  Monika S. Kowalczyk,et al.  Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells , 2015, Genome research.

[79]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[80]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[81]  J. Marioni,et al.  Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data , 2013, Genome Biology.

[82]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[83]  Jin Soo Lee,et al.  FX: an RNA-Seq analysis tool on the cloud , 2012, Bioinform..

[84]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[85]  Catalin C. Barbacioru,et al.  RNA-Seq analysis to capture the transcriptome landscape of a single cell , 2010, Nature Protocols.

[86]  Tal Nawy,et al.  Single-cell sequencing , 2013, Nature Methods.

[87]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[88]  Mahmut Samil Sagiroglu,et al.  GeneCOST: a novel scoring-based prioritization framework for identifying disease causing genes , 2015, Bioinform..

[89]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[90]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[91]  Henning Hermjakob,et al.  Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework , 2012, BMC Bioinformatics.

[92]  Tsachy Weissman,et al.  smallWig: parallel compression of RNA-seq WIG files , 2015, Bioinform..

[93]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[94]  Aleksandra A. Kolodziejczyk,et al.  The technology and biology of single-cell RNA sequencing. , 2015, Molecular cell.

[95]  Robert Grossman,et al.  PeakRanger: A cloud-enabled peak caller for ChIP-seq data , 2011, BMC Bioinformatics.

[96]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[97]  Bo Ding,et al.  Normalization and noise reduction for single cell RNA-seq experiments , 2015, Bioinform..

[98]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[99]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[100]  J. Vockley,et al.  Precision medicine in the age of big data: The present and future role of large‐scale unbiased sequencing in drug discovery and development , 2016, Clinical pharmacology and therapeutics.