Fully-sensitive seed finding in sequence graphs using a hybrid index

Motivation Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus—a property that is not exploited by extant methods. Results We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project data set. On this graph, PSI outperforms GCSA2 in terms of index size, query time, and sensitivity. Availability The C++ implementation is publicly available at: https://github.com/cartoonist/psi.

[1]  Knut Reinert,et al.  The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. , 2017, Journal of biotechnology.

[2]  Simon J. Puglisi,et al.  Searching and Indexing Genomic Databases via Kernelization , 2014, bioRxiv.

[3]  L. A. Uroshlev,et al.  An Efficient Algorithm for Mapping of Reads to a Genome Graph Using an Index Based on Hash Tables and Dynamic Programming , 2018 .

[4]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[5]  Meng He,et al.  Indexing Compressed Text , 2003 .

[6]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[7]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[8]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[9]  Michael Kube,et al.  Complete Genome Sequences of the Obligate Symbionts “Candidatus Sulcia muelleri” and “Ca. Nasuia deltocephalinicola” from the Pestiferous Leafhopper Macrosteles quadripunctulatus (Hemiptera: Cicadellidae) , 2016, Genome Announcements.

[10]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[11]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[12]  Knut Reinert,et al.  Alignment of Next-Generation Sequencing Reads. , 2015, Annual review of genomics and human genetics.

[13]  Robert Giegerich,et al.  A Comparison of Imperative and Purely Functional Suffix Tree Constructions , 1995, Sci. Comput. Program..

[14]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[15]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[16]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[17]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[18]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[19]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[20]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[21]  Benedict Paten,et al.  Haplotype-aware graph indexes , 2018, bioRxiv.

[22]  Nae-Chyun Chen,et al.  FORGe: prioritizing variants for graph genomes , 2018, Genome Biology.

[23]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Einar Andreas Rødland,et al.  Compact representation of k-mer de Bruijn graphs for genome read assembly , 2013, BMC Bioinformatics.

[25]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[26]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[27]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014 .

[28]  Szymon Grabowski,et al.  Indexes of Large Genome Collections on a PC , 2014, PloS one.

[29]  Gonzalo Navarro,et al.  Improved approximate pattern matching on hypertext , 1998, Theor. Comput. Sci..

[30]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[31]  Pierre Peterlongo,et al.  Read Mapping on de Bruijn graph , 2015, ArXiv.

[32]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[33]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[34]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[35]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2019, Bioinform..

[36]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[37]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[38]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.