Enabling graph appliance for genome assembly

In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers that have developed use de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multithreaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray's Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing de Bruin graphs as RDF graphs and propose an iterative querying approach for searching cycles to find Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.

[1]  Sangkeun Lee,et al.  Graph mining meets the Semantic Web , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[2]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[3]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.

[4]  Jesse D. Miller,et al.  An Introduction to Next-Generation Sequencing Technology , 2011 .

[5]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[6]  Hongyan Wu,et al.  BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data , 2014, J. Biomed. Semant..

[7]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[8]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[9]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[10]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[11]  Volker Linnemann,et al.  Using an index of precomputed joins in order to speed up SPARQL processing , 2007, ICEIS.

[12]  Martin G. Skjæveland Sgvizler: A JavaScript Wrapper for Easy Visualization of SPARQL Result Sets , 2012, ESWC.

[13]  Ortrud R. Oellermann,et al.  An Eulerian exposition , 1986, J. Graph Theory.

[14]  Dave J. Beckett,et al.  The design and implementation of the redland RDF application framework , 2001, WWW '01.

[15]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[16]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[17]  Tim Clark,et al.  Semantic Web repositories for genomics data using the eXframe platform , 2014, Journal of Biomedical Semantics.

[18]  Siu-Ming Yiu,et al.  Erratum: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2015, GigaScience.

[19]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[20]  Antoine Limasset,et al.  Assembly improvements by read mapping and phasing , 2014 .

[21]  David F. Wood,et al.  Kowari: A Platform for Semantic Web Storage and Analysis , 2005, WWW 2005.

[22]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[23]  D Shivalingaiah,et al.  Semantic Web Tools: An Overview , 2009 .

[24]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[25]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[26]  R. Singh Graphical User Interface for Silk-A Link Discovery Framework for the Web of Data , 2011 .

[27]  Eric Miller,et al.  An Introduction to the Resource Description Framework , 1998, D Lib Mag..

[28]  Xiang Wan,et al.  Gemma: a resource for the reuse, sharing and meta-analysis of expression profiling data , 2012, Bioinform..

[29]  John R. Gilbert,et al.  Implementing Iterative Algorithms with SPARQL , 2014, EDBT/ICDT Workshops.

[30]  Mark Ellisman,et al.  e-Neuroscience: challenges and triumphs in integrating distributed data from molecules to brains , 2004, Nature Neuroscience.

[31]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.