Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows

While Next-Generation Sequencing (NGS) can now be considered an established analysis technology for research applications across the life sciences, the analysis workflows still require substantial bioinformatics expertise. Typical challenges include the appropriate selection of analytical software tools, the speedup of the overall procedure using HPC parallelization and acceleration technology, the development of automation strategies, data storage solutions and finally the development of methods for full exploitation of the analysis results across multiple experimental conditions. Recently, NGS has begun to expand into clinical environments, where it facilitates diagnostics enabling personalized therapeutic approaches, but is also accompanied by new technological, legal and ethical challenges. There are probably as many overall concepts for the analysis of the data as there are academic research institutions. Among these concepts are, for instance, complex IT architectures developed in-house, ready-to-use technologies installed on-site as well as comprehensive Everything as a Service (XaaS) solutions. In this mini-review, we summarize the key points to consider in the setup of the analysis architectures, mostly for scientific rather than diagnostic purposes, and provide an overview of the current state of the art and challenges of the field.

[1]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[2]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[3]  Åsa Johansson,et al.  Corrigendum: 1000 Genomes-based meta-analysis identifies 10 novel loci for kidney function , 2017, Scientific Reports.

[4]  Chuan He,et al.  Fate by RNA methylation: m6A steers stem cell pluripotency , 2015, Genome Biology.

[5]  Jorge Andrade,et al.  ExScalibur: A High-Performance Cloud-Enabled Suite for Whole Exome Germline and Somatic Mutation Identification , 2015, PloS one.

[6]  Bernhard Y. Renard,et al.  PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data , 2017, Scientific Reports.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Lee Murray,et al.  The 100,000 Genomes Project , 2015 .

[9]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[10]  Peter Nürnberg,et al.  CoNCoS: Copy number estimation in cancer with controlled support , 2015, J. Bioinform. Comput. Biol..

[11]  Ke Qiu,et al.  Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA , 2017 .

[12]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[13]  D. Mehta,et al.  Peripheral blood gene expression: it all boils down to the RNA collection tubes , 2012, BMC Research Notes.

[14]  Patrick Schorderet,et al.  NEAT: a framework for building fully automated NGS pipelines and analyses , 2016, BMC Bioinformatics.

[15]  I. Povarov To a new level of automation , 1990 .

[16]  Graham Pullan,et al.  BarraCUDA - a fast short read sequence aligner using graphics processing units , 2011, BMC Research Notes.

[17]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[18]  J. Fu,et al.  Support vector machine-based nomogram predicts postoperative distant metastasis for patients with oesophageal squamous cell carcinoma , 2013, British Journal of Cancer.

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[21]  Peter Frommolt,et al.  QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation , 2015, BMC Genomics.

[22]  Pranav Kulkarni,et al.  Semi‐automated cancer genome analysis using high‐performance computing , 2017, Human mutation.

[23]  Mikko Koski,et al.  Chipster: user-friendly analysis software for microarray and other high-throughput data , 2011, BMC Genomics.

[24]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[25]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[26]  Manuel Allhoff,et al.  Differential peak calling of ChIP-seq signals with replicates with THOR , 2016, Nucleic acids research.

[27]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[28]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[29]  Fatima Zare,et al.  An evaluation of copy number variation detection tools for cancer using whole exome sequencing data , 2017, BMC Bioinformatics.

[30]  Roland Eils,et al.  A comprehensive comparison of tools for differential ChIP-seq analysis , 2016, Briefings Bioinform..

[31]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[32]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[33]  E. Mardis The $1,000 genome, the $100,000 analysis? , 2010, Genome Medicine.

[34]  Stephen J. Salipante,et al.  A Year of Infection in the Intensive Care Unit: Prospective Whole Genome Sequencing of Bacterial Clinical Isolates Reveals Cryptic Transmissions and Novel Microbiota , 2015, PLoS genetics.

[35]  Hugo Y. K. Lam,et al.  Detecting and annotating genetic variations using the HugeSeq pipeline , 2012, Nature Biotechnology.

[36]  Jaeyoung Choi,et al.  funRNA: a fungi-centered genomics platform for genes encoding key components of RNAi , 2014, BMC Genomics.

[37]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[38]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[39]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[40]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[41]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[42]  Ola Spjuth,et al.  A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data , 2015, GigaScience.

[43]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[44]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[45]  Yike Guo,et al.  High dimensional biological data retrieval optimization with NoSQL technology , 2014, BMC Genomics.

[46]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[47]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Peter White,et al.  Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics , 2015, Genome Biology.

[49]  Thomas K. F. Wong,et al.  SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner , 2013, PloS one.

[50]  S. O’Brien,et al.  SmileFinder: a resampling-based approach to evaluate signatures of selection from genome-wide sets of matching allele frequency data in two or more diploid populations , 2015, GigaScience.