Scalable genomics: from raw data to aligned reads on Apache YARN

The adoption of Big Data technologies can potentially boost the scalability of data-driven biology and health workflows by orders of magnitude. Consider, for instance, that technologies in the Hadoop ecosystem have been successfully used in data-driven industry to scale their processes to levels much larger than any biological-or health-driven work attempted thus far. In this work we demonstrate the scalability of a sequence alignment pipeline based on technologies from the Hadoop ecosystem — namely, Apache Flink and Hadoop MapReduce, both running on the distributed Apache YARN platform. Unlike previous work, our pipeline starts processing directly from the raw BCL data produced by Illumina sequencers. A Flink-based distributed algorithm reconstructs reads from the Illumina BCL data, and then demultiplexes them — analogously to the bcl2fastq2 program provided by Illumina. Subsequently, the BWA-MEM-based distributed aligner from the Seal project is used to perform read mapping on the YARN platform. While the standard programs by Illumina and BWA-MEM are limited to shared-memory parallelism (multi-threading), our solution is completely distributed and can scale across a large number of computing nodes. Results show excellent pipeline scalability, linear in the number of nodes. In addition, this approach automatically benefits from the robustness to hardware failure and transient cluster problems provided by the YARN platform, as well as the scalability of the Hadoop Distributed File System. Moreover, this YARN-based approach complements the up-and-coming version 4 of the GATK toolkit, which is based on Spark and therefore can run on YARN. Together, they can be used to form a scalable complete YARN-based variant calling pipeline for Illumina data, which will be further improved with the arrival of distributed in-memory filesystem technology such as Apache Arrow, thus removing the need to write intermediate data to disk.

[1]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[2]  Scott D Boyd,et al.  Diagnostic applications of high-throughput DNA sequencing. , 2013, Annual review of pathology.

[3]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[4]  Nasir Ahmad,et al.  An optimized and low-cost FPGA-based DNA sequence alignment — A step towards personal genomics , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[5]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[6]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[7]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[8]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[9]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[10]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[11]  Luca Pireddu,et al.  MapReducing a genomic sequencing workflow , 2011, MapReduce '11.

[12]  Ola Spjuth,et al.  Experiences with workflows for automating data-intensive bioinformatics , 2015, Biology Direct.

[13]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[14]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[15]  S. Tofanelli,et al.  Low-Pass DNA Sequencing of 1200 Sardinians Reconstructs European Y-Chromosome Phylogeny , 2013, Science.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[18]  Bhim P. Upadhyaya,et al.  Programming with Scala , 2017, Undergraduate Topics in Computer Science.

[19]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[20]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[21]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[22]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[23]  Boris Schling The Boost C++ Libraries , 2011 .

[24]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[25]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[26]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[27]  Graham Pullan,et al.  BarraCUDA - a fast short read sequence aligner using graphics processing units , 2011, BMC Research Notes.

[28]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[29]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[30]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[31]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[32]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[33]  Octavio Nieto-Taladriz,et al.  Fpga Acceleration for DNA Sequence Alignment , 2007, J. Circuits Syst. Comput..

[34]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[35]  Mark J. Ratain,et al.  Tumour heterogeneity in the clinic , 2013, Nature.

[36]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[37]  Wieslawa I. Mentzen,et al.  Genetic Variants Regulating Immune Cell Levels in Health and Disease , 2013, Cell.

[38]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[39]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[40]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[41]  Masanori Hariyama,et al.  FPGA­Accelerator for DNA Sequence Alignment Based on an Ef ficient Data­Dependent Memory Access Scheme , 2014 .

[42]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[43]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[44]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[45]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.