Space Efficient Computation of Rare Maximal Exact Matches between Multiple Sequences

In this article, we propose a new method for computing rare maximal exact matches between multiple sequences. A rare match between k sequences S(1), ... , S(k) is a string that occurs at most t(i)-times in the sequence S(i), where the t(i) > 0 are user-defined thresholds. First, the suffix tree of one of the sequences (the reference sequence) is built, and then the other sequences are matched separately against this suffix tree. Second, the resulting pairwise exact matches are combined to multiple exact matches. A clever implementation of this method yields a very fast and space efficient program. This program can be applied in several comparative genomics tasks, such as the identification of synteny blocks between whole genomes.

[1]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[2]  Mohamed Ibrahim Abouelhoda,et al.  A Chaining Algorithm for Mapping cDNA Sequences to Multiple Genomic Sequences , 2007, SPIRE.

[3]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[4]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[5]  Jens Stoye,et al.  Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[6]  Enno Ohlebusch,et al.  Enhanced Suffix Arrays and Applications , 2005 .

[7]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[9]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[10]  Jill P. Mesirov,et al.  Computational Biology , 2018, Encyclopedia of Parallel Computing.

[11]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[12]  Jitender S. Deogun,et al.  EMAGEN: An Efficient Approach to Multiple Whole Genome Alignment , 2004, APBC.

[13]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[14]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[15]  Aaron E. Darling,et al.  Identifying Evolutionarily Conserved Segments Among Multiple Divergent and Rearranged Genomes , 2004, Comparative Genomics.

[16]  Stefano Lonardi,et al.  Computational Biology , 2004, Handbook of Data Structures and Applications.

[17]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[18]  Enno Ohlebusch,et al.  An Applications-focused Review of Comparative Genomics Tools: Capabilities, Limitations and Future Challenges , 2003, Briefings Bioinform..

[19]  P. Pevzner,et al.  Genome-scale evolution: reconstructing gene orders in the ancestral species. , 2002, Genome research.

[20]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[21]  Enno Ohlebusch,et al.  A Local Chaining Algorithm and Its Applications in Comparative Genomics , 2003, WABI.

[22]  Meriem El Karoui,et al.  Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops , 2005, BMC Bioinformatics.

[23]  P. Pevzner,et al.  Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. , 2003, Genome research.

[24]  Esko Ukkonen,et al.  On{line Construction of Suux Trees 1 , 1995 .