Succinct Colored de Bruijn Graphs

Iqbal et al. (Nature Genetics, 2012) introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at “detecting and genotyping simple and complex genetic variants in an individual or population”. Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently. Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which require additional information to be succinctly encoded as well as support for non-standard traversal operations. Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible.

[1]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[2]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[3]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[4]  M. Mckenna,et al.  Antibiotic resistance: The last resort , 2013, Nature.

[5]  Yoshihiro Kawahara,et al.  The Rice Annotation Project Database (RAP-DB): 2008 update , 2007, Nucleic Acids Res..

[6]  S. Rasmussen,et al.  Identification of acquired antimicrobial resistance genes , 2012, The Journal of antimicrobial chemotherapy.

[7]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[8]  Andrew C. Pawlowski,et al.  The Comprehensive Antibiotic Resistance Database , 2013, Antimicrobial Agents and Chemotherapy.

[9]  R. Mott,et al.  The 1001 Genomes Project for Arabidopsis thaliana , 2009, Genome Biology.

[10]  J. Rolain,et al.  ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes , 2013, Antimicrobial Agents and Chemotherapy.

[11]  Christina Boucher,et al.  Misassembly detection using paired-end sequence reads and optical mapping data , 2014, Bioinform..

[12]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[13]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[14]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[15]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[16]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[17]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[18]  Yasukazu Nakamura,et al.  Genome-wide analysis of intraspecific DNA polymorphism in 'Micro-Tom', a model cultivar of tomato (Solanum lycopersicum). , 2014, Plant & cell physiology.

[19]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[20]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[21]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[22]  K. Bush,et al.  Carbapenemases: the Versatile β-Lactamases , 2007, Clinical Microbiology Reviews.

[23]  Jean-Paul Bouchet,et al.  Whole genome resequencing in tomato reveals variation associated with introgression and breeding events , 2013, BMC Genomics.

[24]  F. Baquero,et al.  Metagenomic epidemiology: a public health need for the control of antimicrobial resistance. , 2012, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[25]  Elaine M. Faustman,et al.  Metagenomic Frameworks for Monitoring Antibiotic Resistance in Aquatic Environments , 2013, Environmental health perspectives.

[26]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[27]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[28]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[29]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[30]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[31]  Meng He,et al.  Indexing Compressed Text , 2003 .

[32]  Hamidreza Chitsaz,et al.  SEQuel: improving the accuracy of genome assemblies , 2012, Bioinform..