Scalable genomics: from raw data to aligned reads on Apache YARN

The adoption of Big Data technologies can potentially boost the scalability of data-driven biology and health workflows by orders of magnitude. Consider, for instance, that technologies in the Hadoop ecosystem have been successfully used in data-driven industry to scale their processes to levels much larger than any biological- or health-driven work attempted thus far. In this work we demonstrate the scalability of a sequence alignment pipeline based on technologies from the Hadoop ecosystem – namely, Apache Flink and Hadoop MapReduce, both running on the distributed Apache YARN platform. Unlike previous work, our pipeline starts processing directly from the raw BCL data produced by Illumina sequencers. A Flink-based distributed algorithm reconstructs reads from the Illumina BCL data, and then demultiplexes them – analogously to the bcl2fastq2 program provided by Illumina. Subsequently, the BWA-MEM-based distributed aligner from the Seal project is used to perform read mapping on the YARN platform. While the standard programs by Illumina and BWA-MEM are limited to shared-memory parallelism (multi-threading), our solution is completely distributed and can scale across a large number of computing nodes. Results show excellent pipeline scalability, linear in the number of nodes. In addition, this approach automatically benefits from the robustness to hardware failure and transient cluster problems provided by the YARN platform, as well as the scalability of the Hadoop Distributed File System. Moreover, this YARN-based approach complements the up-and-coming version 4 of the GATK toolkit, which is based on Spark and therefore can run on YARN. Together, they can be used to form a scalable complete YARN-based variant calling pipeline for Illumina data, which will be further improved with the arrival of distributed in-memory filesystem technology such as Apache Arrow, thus removing the need to write intermediate data to disk. Original article This paper was presented at the IEEE International Conference on Big Data, 2016 and is available at https://doi.org/10.1109/BigData.2016.7840727

[1]  Wieslawa I. Mentzen,et al.  Genetic Variants Regulating Immune Cell Levels in Health and Disease , 2013, Cell.

[2]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[3]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[4]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[5]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[6]  P. Mermelstein,et al.  Opposite Effects of mGluR1a and mGluR5 Activation on Nucleus Accumbens Medium Spiny Neuron Dendritic Spine Density , 2016, PloS one.

[7]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[8]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[9]  Masanori Hariyama,et al.  FPGA­Accelerator for DNA Sequence Alignment Based on an Ef ficient Data­Dependent Memory Access Scheme , 2014 .

[10]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[11]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[12]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[13]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[14]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[15]  Scott D Boyd,et al.  Diagnostic applications of high-throughput DNA sequencing. , 2013, Annual review of pathology.

[16]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[17]  Nasir Ahmad,et al.  An optimized and low-cost FPGA-based DNA sequence alignment — A step towards personal genomics , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[18]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[19]  Graham Pullan,et al.  BarraCUDA - a fast short read sequence aligner using graphics processing units , 2011, BMC Research Notes.

[20]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[21]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[22]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[23]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[24]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[25]  Ola Spjuth,et al.  Experiences with workflows for automating data-intensive bioinformatics , 2015, Biology Direct.

[26]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[27]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[28]  Luca Pireddu,et al.  MapReducing a genomic sequencing workflow , 2011, MapReduce '11.

[29]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[30]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[31]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[32]  S. Tofanelli,et al.  Low-Pass DNA Sequencing of 1200 Sardinians Reconstructs European Y-Chromosome Phylogeny , 2013, Science.

[33]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[34]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[35]  Michael R. Kosorok,et al.  Detection of gene pathways with predictive power for breast cancer prognosis , 2010, BMC Bioinformatics.

[36]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[37]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[38]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[39]  Boris Schling The Boost C++ Libraries , 2011 .

[40]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[41]  Octavio Nieto-Taladriz,et al.  Fpga Acceleration for DNA Sequence Alignment , 2007, J. Circuits Syst. Comput..

[42]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[43]  Mark J. Ratain,et al.  Tumour heterogeneity in the clinic , 2013, Nature.

[44]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[45]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[46]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[47]  Bhim P. Upadhyaya,et al.  Programming with Scala , 2017, Undergraduate Topics in Computer Science.

[48]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[49]  Boris Sch Ling The Boost C++ Libraries , 2011 .

[50]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.