AStarix: Fast and Optimal Sequence-to-Graph Alignment

We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A\(^\star \) paradigm with a novel domain-specific heuristic function that accounts for the upcoming subsequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix.

[1]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[2]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[3]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[4]  Peter H. Sellers,et al.  An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.

[5]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[6]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[7]  Rina Dechter,et al.  Generalized best-first search strategies and the optimality of A* , 1985, JACM.

[8]  Naveen Sivadasan,et al.  Sequence Alignment on Directed Graphs , 2017, bioRxiv.

[9]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  Chirag Jain,et al.  On the Complexity of Sequence to Graph Alignment , 2019 .

[12]  Vipin T. Sreedharan,et al.  RNA‐Seq Read Alignments with PALMapper , 2010, Current protocols in bioinformatics.

[13]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[14]  Olivier Harismendy,et al.  Detection of low prevalence somatic mutations in solid tumors with ultra-deep targeted sequencing , 2011, Genome Biology.

[15]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[16]  Yves Van de Peer,et al.  BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs , 2018, BMC Bioinform..

[17]  Judea Pearl On the Discovery and Generation of Certain Heuristics , 1983, AI Mag..

[18]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[19]  Lloyd Allison,et al.  Lazy Dynamic-Programming Can Be Eager , 1992, Inf. Process. Lett..

[20]  Chirag Jain,et al.  Accelerating Sequence Alignment to Graphs , 2019, bioRxiv.

[21]  A. Sanchez‐Mazas,et al.  HLA DNA Sequence Variation among Human Populations: Molecular Signatures of Demographic and Selective Events , 2011, PloS one.

[22]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[23]  P. Wittkopp,et al.  Sources of bias in measures of allele-specific expression derived from RNA-seq data aligned to a single reference genome , 2013, BMC Genomics.

[24]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[25]  Tobias Marschall,et al.  Aligning sequences to general graphs in O(V + mE) time , 2017, bioRxiv.

[26]  Pierre Peterlongo,et al.  Toward perfect reads , 2017, Bioinform..

[27]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2019, Bioinform..

[28]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[29]  Yadong Wang,et al.  deBGA: read alignment with de Bruijn graph-based seed and extension , 2016, Bioinform..

[30]  James E. Allen,et al.  Ensembl Genomes 2020—enabling non-vertebrate genomic research , 2019, Nucleic Acids Res..

[31]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[32]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Jérôme Goudet,et al.  Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data , 2014 .