Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Abstract Background Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

[1]  Zachary A. Szpiech,et al.  A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome , 2016, Nature Communications.

[2]  Karl Gruber Google for genomes , 2014, Nature Biotechnology.

[3]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[4]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[5]  Fabian A. Buske,et al.  VariantSpark: population scale clustering of genotype information , 2015, BMC Genomics.

[6]  Ivan Merelli,et al.  Managing, Analysing, and Integrating Big Data in Medical Bioinformatics: Open Problems and Future Perspectives , 2014, BioMed research international.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[9]  Hui Guo,et al.  VSEAMS: a pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes , 2014, Bioinform..

[10]  Yike Guo,et al.  CGDM: collaborative genomic data model for molecular profiling data using NoSQL , 2016, Bioinform..

[11]  Peggy L Peissig,et al.  SeqHBase: a big data toolset for family based sequencing data analysis , 2015, Journal of Medical Genetics.

[12]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[13]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[14]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[17]  M. N. Vora,et al.  Hadoop-HBase for large-scale data , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[18]  Wei Zhou,et al.  MetaSpark: a spark‐based distributed processing tool to recruit metagenomic reads to reference genomes , 2017, Bioinform..

[19]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[20]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[21]  Zhengwei Zhu,et al.  FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes , 2011, Bioinform..

[22]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[23]  Hao Wu,et al.  A novel statistical method for quantitative comparison of multiple ChIP-seq datasets , 2015, Bioinform..

[24]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[25]  Philippe Flajolet,et al.  An introduction to the analysis of algorithms , 1995 .

[26]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[27]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[28]  Heng Li,et al.  Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..

[29]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[30]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[31]  Sandeep Tata,et al.  BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters , 2013, Bioinform..

[32]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[33]  Ola Spjuth,et al.  A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data , 2015, GigaScience.

[34]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[35]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[36]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[37]  Donald. Miner,et al.  MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems , 2012 .

[38]  Y. Danieli Guide , 2005 .

[39]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[40]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..