MaRe: Processing Big Data with application containers on Apache Spark

Abstract Background Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. Conclusions MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

[1]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[2]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[3]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[4]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[5]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[6]  Reynold Xin,et al.  Apache Spark , 2016 .

[7]  Hanchuan Peng,et al.  Bioimage informatics: a new area of engineering biology , 2008, Bioinform..

[8]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Alejandra N. González-Beltrán,et al.  PhenoMeNal: processing and analysis of metabolomics data in the cloud , 2018, bioRxiv.

[11]  F. Collins,et al.  Shattuck lecture--medical and societal consequences of the Human Genome Project. , 1999, The New England journal of medicine.

[12]  George Papadatos,et al.  SureChEMBL: a large-scale, chemically annotated patent document database , 2015, Nucleic Acids Res..

[13]  K Osterlund,et al.  Unexpected binding mode of a cyclic sulfamide HIV-1 protease inhibitor. , 1997, Journal of medicinal chemistry.

[14]  Milind A. Bhandarkar,et al.  MapReduce programming with apache Hadoop , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[16]  L. Kruglyak Prospects for whole-genome linkage disequilibrium mapping of common disease genes , 1999, Nature Genetics.

[17]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[19]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[20]  Ola Spjuth,et al.  Tracking the NGS revolution: managing life science research on shared high-performance computing clusters , 2018, GigaScience.

[21]  Robert Stevens,et al.  A Survey of Bioinformatics Database and Software Usage through Mining the Literature , 2016, PloS one.

[22]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[23]  Rajkumar Buyya,et al.  Data Storage Management in Cloud Environments , 2017, ACM Comput. Surv..

[24]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[25]  Ola Spjuth,et al.  Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles , 2016, Journal of Cheminformatics.

[26]  Martin Odersky,et al.  An Overview of the Scala Programming Language , 2004 .

[27]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[28]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[29]  Ulysses G. J. Balis,et al.  The growing need for microservices in bioinformatics , 2016, Journal of pathology informatics.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  Ola Spjuth,et al.  Large-scale virtual screening on public cloud resources with Apache Spark , 2017, Journal of Cheminformatics.

[32]  Ching-Hsien Hsu,et al.  GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers , 2017, Computing.

[33]  Xiaoqiao Meng,et al.  Delay tails in MapReduce scheduling , 2012, SIGMETRICS '12.

[34]  Yanli Wang,et al.  Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review , 2012, The AAPS Journal.

[35]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[36]  Ola Spjuth,et al.  Galaxy-Kubernetes integration: scaling bioinformatics workflows in the cloud , 2018, bioRxiv.

[37]  Mark McGann,et al.  FRED Pose Prediction and Virtual Screening Accuracy , 2011, J. Chem. Inf. Model..

[38]  Leonard J. Foster,et al.  At the Intersection of Proteomics and Big Data Science. , 2017, Clinical chemistry.

[39]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[40]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[41]  Ola Spjuth,et al.  SNIC Science Cloud (SSC): A National-Scale Cloud Infrastructure for Swedish Academia , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[42]  Robert C. Elston,et al.  Defining “mutation” and “polymorphism” in the era of personal genomics , 2015, BMC Medical Genomics.

[43]  Ola Spjuth,et al.  Efficient iterative virtual screening with Apache Spark and conformal prediction , 2018, Journal of Cheminformatics.

[44]  A. Helwak,et al.  High Guanine and Cytosine Content Increases mRNA Levels in Mammalian Cells , 2006, PLoS biology.

[45]  Duen Horng Chau,et al.  Building Big Data Processing and Visualization Pipeline through Apache Zeppelin , 2018, PEARC.

[46]  Ola Spjuth,et al.  Container-based bioinformatics with Pachyderm , 2018, bioRxiv.

[47]  Rolf Apweiler,et al.  The European Bioinformatics Institute in 2018: tools, infrastructure and training , 2018, Nucleic Acids Res..

[48]  Long Zheng,et al.  More convenient more overhead: the performance evaluation of Hadoop streaming , 2011, RACS.

[49]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[50]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[51]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..