Kafka interfaces for composable streaming genomics pipelines

Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The standard workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given the constant increase of the amount of data produced, this approach is proving more and more unmanageable, due to its lack of robustness and scalability. In this work we propose the adoption of stream computing to simplify the genomic pipeline, boost its performance and improve its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and we loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already existing Hadoop-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.

[1]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[2]  Ola Spjuth,et al.  Experiences with workflows for automating data-intensive bioinformatics , 2015, Biology Direct.

[3]  Scott D Boyd,et al.  Diagnostic applications of high-throughput DNA sequencing. , 2013, Annual review of pathology.

[4]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[5]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[6]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[7]  Gianluigi Zanetti,et al.  Scalable genomics: from raw data to aligned reads on Apache YARN , 2016, bioRxiv.

[8]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[9]  Michael R. Kosorok,et al.  Detection of gene pathways with predictive power for breast cancer prognosis , 2010, BMC Bioinformatics.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[12]  Thomas Weise,et al.  Apache Apex , 2019, Encyclopedia of Big Data Technologies.

[13]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[14]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[15]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[16]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[17]  Andrew H. Paterson,et al.  Application of genotyping by sequencing technology to a variety of crop breeding programs. , 2016, Plant science : an international journal of experimental plant biology.

[18]  Vivien Marx,et al.  The DNA of a nation , 2015, Nature.

[19]  Mark J. Ratain,et al.  Tumour heterogeneity in the clinic , 2013, Nature.

[20]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[21]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[22]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[23]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[24]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[25]  Abhishek Roy,et al.  Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study , 2017, SIGMOD Conference.

[26]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[27]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[28]  Wieslawa I. Mentzen,et al.  Genetic Variants Regulating Immune Cell Levels in Health and Disease , 2013, Cell.

[29]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[30]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[31]  Luca Pireddu,et al.  MapReducing a genomic sequencing workflow , 2011, MapReduce '11.

[32]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[33]  S. Tofanelli,et al.  Low-Pass DNA Sequencing of 1200 Sardinians Reconstructs European Y-Chromosome Phylogeny , 2013, Science.

[34]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[35]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[36]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..

[37]  Manuel Salto-Tellez,et al.  Tissue-based next generation sequencing: application in a universal healthcare system , 2017, British Journal of Cancer.

[38]  Boris Sch Ling The Boost C++ Libraries , 2011 .

[39]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[40]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[41]  Chuan-Ming Liu,et al.  Big data stream computing in healthcare real-time analytics , 2016, 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[42]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[43]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.