GenAp: a distributed SQL interface for genomic data

BackgroundThe impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties.ResultsIn this paper, we show that the response to this challenge is a modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50× faster than its existing brute force approach and 8× faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data and, as we show, allow users to reduce the number of lines of software code that needs to be developed to query such data by an order of magnitude.

[1]  Randy H. Katz,et al.  Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[2]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[3]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[4]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[5]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[6]  Brian Craft,et al.  The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data , 2014, Database J. Biol. Databases Curation.

[7]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8]  Marco Masseroli,et al.  GenoMetric Query Language: a novel approach to large-scale genomic data management , 2015, Bioinform..

[9]  George Varghese,et al.  Abstractions for genomics , 2013, CACM.

[10]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[11]  George Varghese,et al.  Using Genome Query Language to uncover genetic variation , 2014, Bioinform..

[12]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..

[13]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[14]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[15]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[16]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[17]  Lior Pachter,et al.  The NIH BD2K center for big data in translational genomics , 2015, J. Am. Medical Informatics Assoc..