Read mapping on de Bruijn graphs

BackgroundNext Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs.ResultsHere, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set.ConclusionsAlthough mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.

[1]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[2]  Vineet Bafna,et al.  Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads , 2013, WABI.

[3]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[4]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[5]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[6]  Haixu Tang,et al.  A de Bruijn Graph Approach to the Quantification of Closely-Related Genomes in a Microbial Community , 2012, J. Comput. Biol..

[7]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[8]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[9]  BMC Bioinformatics , 2005 .

[10]  Paul Medvedev,et al.  On the Representation of de Bruijn Graphs , 2014, RECOMB.

[11]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[12]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[13]  Marie-France Sagot,et al.  Theme: Computational Biology and Bioinformatics Computational Sciences for Biology, Medicine and the Environment , 2012 .

[14]  Michael C. Schatz,et al.  Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score , 2012, Bioinform..

[15]  Guillaume Holley,et al.  BlastGraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs , 2012 .

[16]  Guillaume Holley,et al.  BlastGraph: Intensive Approximate Pattern Matching in Sequence Graphs and de-Bruijn Graphs , 2012, Stringology.

[17]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[18]  Hélène Touzet,et al.  Lossless Seeds for Searching Short Patterns with High Error Rates , 2014, IWOCA.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[21]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[22]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[23]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[24]  Haixu Tang,et al.  Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis , 2015, Bioinform..

[25]  Dominique Lavenier,et al.  Bloocoo, a memory efficient read corrector , 2014, ECCB 2014.

[26]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[27]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[28]  David Rosenkranz,et al.  proTRAC - a software for probabilistic piRNA cluster detection, visualization and analysis , 2012, BMC Bioinformatics.

[29]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[30]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[31]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.