SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.

[1]  Lucian Ilie,et al.  E-MEM: efficient computation of maximal exact matches for very large genomes , 2015, Bioinform..

[2]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[5]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[7]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[8]  Gordon Gremme,et al.  GenomeTools: A Comprehensive Software Library for Efficient Processing of Structured Genome Annotations , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[10]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[11]  Sylvain Gaillard,et al.  Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics , 2006, BMC Bioinformatics.

[12]  Francisco Fernandes,et al.  slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array , 2014, Bioinform..

[13]  Bowen Zhou,et al.  Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[15]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[16]  Christian Blouin,et al.  libcov: A C++ bioinformatic library to manipulate protein structures, sequence alignments and phylogeny , 2005, BMC Bioinformatics.

[17]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..

[18]  J. Bolker,et al.  Model organisms: There's more to life than rats and flies , 2012, Nature.

[19]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[20]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[21]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[22]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[23]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[24]  Korbinian Strimmer,et al.  PAL: an object-oriented programming library for molecular evolution and phylogenetics , 2001, Bioinform..

[25]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[28]  Olga G. Troyanskaya,et al.  The Sleipnir library for computational functional genomics , 2008, Bioinform..

[29]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[30]  L. Guarente,et al.  Genetic pathways that regulate ageing in model organisms , 2000, Nature.

[31]  Mark A. Williams,et al.  The Bioinformatics Template Library—generic Components for Biocomputing , 2001 .

[32]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[33]  Mona Singh,et al.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays , 2009, Bioinform..

[34]  Anil Wipat,et al.  e-Science: relieving bottlenecks in large-scale genome analyses , 2008, Nature Reviews Microbiology.

[35]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[36]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[37]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[38]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[39]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[40]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..