TMO: time and memory optimized algorithm applicable for more accurate alignment of trinucleotide repeat disorders associated genes

ABSTRACT In this study, time and memory optimized (TMO) algorithm is presented. Compared with Smith–Waterman's algorithm, TMO is applicable for a more accurate detection of continuous insertion/deletions (indels) in genes’ fragments, associated with disorders caused by over-repetition of a certain codon. The improvement comes from the tendency to pinpoint indels in the least preserved nucleotide pairs. All nucleotide pairs that occur less frequently are classified as less preserved and they are considered as mutated codons whose mid-nucleotides were deleted. Other benefit of the proposed algorithm is its general tendency to maximize the number of matching nucleotides included per alignment, regardless of any specific alignment metrics. Since the structure of the solution, when applying Smith–Waterman, depends on the adjustment of the alignment parameters and, therefore, an incomplete (shortened) solution may be derived, our algorithm does not reject any of the consistent matching nucleotides that can be included in the final solution. In terms of computational aspects, our algorithm runs faster than Smith–Waterman for very similar DNA and requires less memory than the most memory efficient dynamic programming algorithms. The speed up comes from the reduced number of nucleotide comparisons that have to be performed, without having to imperil the completeness of the solution. Due to the fact that four integers (16 Bytes) are required for tracking matching fragment, regardless its length, our algorithm requires less memory than Huang's algorithm.

[1]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[2]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[3]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[6]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[7]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[8]  Webb Miller,et al.  A space-efficient algorithm for local similarities , 1990, Comput. Appl. Biosci..

[9]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[10]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[11]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[12]  A. Sano,et al.  Dentatorubral and pallidoluysian atrophy expansion of an unstable CAG trinucleotide on chromosome 12p , 1994, Nature Genetics.

[13]  Osamu Gotoh Pattern matching of biological sequences with limited storage , 1987, Comput. Appl. Biosci..

[14]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[15]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[16]  S. M. Ulam Some Combinatorial Problems Studied Experimentally on Computing Machines , 1972 .

[17]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[18]  Huda Y. Zoghbi,et al.  Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1 , 1993, Nature Genetics.

[19]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[20]  D. Craufurd,et al.  Behavioral changes in Huntington Disease. , 2001, Neuropsychiatry, neuropsychology, and behavioral neurology.

[21]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[25]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[27]  Peter H. Sellers,et al.  An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.

[28]  Joseph B. Martin Huntington's disease , 1984, Neurology.

[29]  Adam Yao,et al.  Super Pairwise Alignment (SPA): An Efficient Approach to Global Alignment for Homologous Sequences , 2003, J. Comput. Biol..

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[32]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[33]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[34]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[35]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.