Anchor points for genome alignment based on Filtered Spaced Word Matches

Alignment of large genomic sequences is a fundamental task in computational genome analysis. Most methods for genomic alignment use high-scoring local alignments as {\em anchor points} to reduce the search space of the alignment procedure. Speed and quality of these methods therefore depend on the underlying anchor points. Herein, we propose to use {\em Filtered Spaced Word Matches} to calculate anchor points for genome alignment. To evaluate this approach, we used these anchor points in the the widely used alignment pipeline {\em Mugsy}. For distantly related sequence sets, we could substantially improve the quality of alignments produced by {\em Mugsy}.

[1]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[2]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[4]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[5]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[6]  Berthold Göttgens,et al.  Regulation of the stem cell leukemia (SCL) gene: A tale of two fishes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  Colin N. Dewey,et al.  Evolution at the nucleotide level: the problem of multiple whole-genome alignment. , 2006, Human molecular genetics.

[9]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[10]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[11]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[12]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[13]  Knut Reinert,et al.  Segment-based multiple sequence alignment , 2008, ECCB.

[14]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[15]  Aleksey Y. Ogurtsov,et al.  OWEN: aligning long collinear regions of genomes , 2002, Bioinform..

[16]  Knut Reinert,et al.  Lambda: the local aligner for massive biological data , 2014, Bioinform..

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[19]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[20]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[21]  Leping Li,et al.  Accurate anchoring alignment of divergent sequences , 2006, Bioinform..

[22]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[23]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[24]  Burkhard Morgenstern,et al.  A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences , 2002, Appl. Math. Lett..

[25]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[26]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[27]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[28]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[29]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[30]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[31]  Bernard M. E. Moret,et al.  Algorithms in Bioinformatics, 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006, Proceedings , 2006, WABI.

[32]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[33]  Burkhard Morgenstern,et al.  rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison , 2015, PLoS Comput. Biol..

[34]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[35]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[36]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[37]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[38]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[39]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[40]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined anchor points , 2006, Algorithms for Molecular Biology.

[41]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[42]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[43]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[44]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[45]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[46]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[47]  Berthold Göttgens,et al.  Comparative and functional analyses of LYL1 loci establish marsupial sequences as a model for phylogenetic footprinting. , 2003, Genomics.

[48]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..