Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform

MOTIVATION Low-cost genome sequencing gives unprecedented complete information about the genetic structure of populations, and a population graph captures the variations between many individuals of a population. Recently, Marcus et al. proposed to use a compressed de Bruijn graph for representing an entire population of genomes. They devised an O(n log g) time algorithm called splitMEM that constructs this graph directly (i.e. without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Since the applicability of their algorithm is limited to rather small datasets, there is a strong need for space-efficient construction algorithms. RESULTS We present two algorithms that outperform splitMEM in theory and in practice. The first implements a novel linear-time suffix tree algorithm by means of a compressed suffix tree. The second algorithm uses the Burrows-Wheeler transform to build the compressed de Bruijn graph in [Formula: see text] time, where σ is the size of the alphabet. To demonstrate the scalability of the algorithms, we applied it to seven human genomes. AVAILABILITY AND IMPLEMENTATION https://www.uni-ulm.de/in/theo/research/seqana/.

[1]  P. Gajer,et al.  The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates , 2008, Journal of bacteriology.

[2]  Enno Ohlebusch,et al.  Space-Efficient Construction of the Burrows-Wheeler Transform , 2013, SPIRE.

[3]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[4]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[5]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[6]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[7]  Knut Reinert,et al.  Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop , 2014, Bioinform..

[8]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[9]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[10]  Thierry Lecroq,et al.  From Indexing Data Structures to de Bruijn Graphs , 2014, CPM.

[11]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Carl Kingsford,et al.  Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees , 2015, bioRxiv.

[13]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[14]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[15]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[16]  Enno Ohlebusch,et al.  Efficient Construction of a Compressed de Bruijn Graph for Pan-Genome Analysis , 2015, CPM.

[17]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.

[18]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[19]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.