Detecting high-scoring local alignments in pangenome graphs

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability and implementation Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Sven Rahmann,et al.  Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling , 2011, BMC Bioinformatics.

[2]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[3]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[4]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[5]  Keith C. Cheng,et al.  SLC24A5, a Putative Cation Exchanger, Affects Pigmentation in Zebrafish and Humans , 2005, Science.

[6]  B. Shapiro,et al.  Origins of pandemic Vibrio cholerae from environmental gene pools , 2016, Nature Microbiology.

[7]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[8]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[9]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[10]  Y. Koda,et al.  Population differences of two coding SNPs in pigmentation-related genes SLC24A5 and SLC45A2 , 2006, International Journal of Legal Medicine.

[11]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[12]  Mile Sikic,et al.  SWORD - a highly efficient protein database search , 2015, bioRxiv.

[13]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[14]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[15]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[16]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[17]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..

[18]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  Alexandre P. Francisco,et al.  GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens , 2017, bioRxiv.

[21]  Naveen Sivadasan,et al.  Sequence Alignment on Directed Graphs , 2017, bioRxiv.

[22]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[23]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[24]  Dominique Lavenier,et al.  PLAST: parallel local alignment search tool for database comparison , 2009, BMC Bioinformatics.

[25]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[26]  Gonzalo Navarro Improved approximate pattern matching on hypertext , 2000, Theor. Comput. Sci..

[27]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[28]  David R. Riley,et al.  Ten years of pan-genome analyses. , 2015, Current opinion in microbiology.

[29]  Martin C. Frith,et al.  A Simplified Description of Child Tables for Sequence Similarity Search , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Nabil-Fareed Alikhan,et al.  A genomic overview of the population structure of Salmonella , 2018, PLoS genetics.

[31]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[32]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[33]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[34]  Takashi Ishida,et al.  Faster sequence homology searches by clustering subsequences , 2014, Bioinform..

[35]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[36]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[37]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2019, Bioinform..

[38]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[39]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[41]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[42]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[43]  J. Kaper,et al.  Comparison of Vibrio choleraePathogenicity Islands in Sixth and Seventh Pandemic Strains , 2001, Infection and Immunity.

[44]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[45]  P. Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2020, Genome biology.

[46]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.