Sequence alignment with tandem duplication

Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modification of sequences proceeds through any of the operations of substitution, insertion or deletion (the latter two collectively termed indels). While this model has worked fairly well, it has long been apparent that other mutational events occur. In this paper, we introduce a new model, the DSI model which includes another common mutational event, tandem duplication. Tandem duplication produces tandem repeats which are common in DNA, making up perhaps 10% of the human genome. They are responsible for some human diseases and may serve a multitude of functions in DNA regulation and evolution. Using the DSI model, we develop new exact and heuristic algorithms for comparing and aligning DNA sequences when they contain tandem repeats.

[1]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[2]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[3]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[4]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 1993, CPM.

[5]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[6]  S. Tishkoff,et al.  Global Patterns of Linkage Disequilibrium at the CD4 Locus and Modern Human Origins , 1996, Science.

[7]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[10]  Jeanette P. Schmidt All shortest paths in weighted grid graphs and its application to finding all approximate repeats in strings , 1995, Proceedings Third Israel Symposium on the Theory of Computing and Systems.

[11]  M. Waterman,et al.  A local algorithm for DNA sequence alignment with inversions , 1992 .

[12]  R. Richards,et al.  Fragile X syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. , 1993, Human molecular genetics.

[13]  S. Elgin,et al.  (CT)n (GA)n repeats and heat shock elements have distinct roles in chromatin structure and transcriptional activation of the Drosophila hsp26 gene , 1993, Molecular and cellular biology.

[14]  L. Jin,et al.  Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. , 1992, Genomics.

[15]  W. Messier,et al.  The birth of microsatellites , 1996, Nature.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[18]  T. Ashizawa,et al.  An unstable triplet repeat in a gene related to myotonic muscular dystrophy. , 1992, Science.

[19]  Gad M. Landau,et al.  Identifying Periodic Occurrences of a Template with Applications to Protein Structures , 1992, CPM.

[20]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[21]  K. Kidd,et al.  Minisatellite diversity supports a recent African origin for modern humans , 1996, Nature Genetics.

[22]  E. Myers,et al.  Approximate matching of regular expressions , 1989 .

[23]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[24]  H. A. Yee,et al.  Identification of novel single-stranded d(TC)n binding proteins in several mammalian species. , 1991, Nucleic acids research.

[25]  L. Hellman,et al.  A rapidly evolving region in the immunoglobulin heavy chain loci of rat and mouse: postulated role of (dC-dA)n.(dG-dT)n sequences. , 1988, Gene.

[26]  P. Patel,et al.  Friedreich's Ataxia: Autosomal Recessive Disease Caused by an Intronic GAA Triplet Repeat Expansion , 1996, Science.

[27]  D. Loesch,et al.  Fragile-X syndrome: unique genetics of the heritable unstable element. , 1992, American journal of human genetics.

[28]  A. Rich,et al.  (dC‐dA)n.(dG‐dT)n sequences have evolutionarily conserved chromosomal locations in Drosophila with implications for roles in chromosome structure and function. , 1987, The EMBO journal.

[29]  A. Apostolio,et al.  A Fast Linear Space Algorithm for Computing Longest Common Subsequences , 1985 .

[30]  J. Weber,et al.  Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. , 1989, American journal of human genetics.

[31]  J. Sutcliffe,et al.  Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome , 1991, Cell.

[32]  H. Hamada,et al.  Enhanced gene expression by the poly(dT-dG).poly(dC-dA) sequence , 1984, Molecular and cellular biology.

[33]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[34]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[35]  Gary Benson A Space Efficient Algorithm for Finding the Best Non-Overlapping Alignment Score , 1994, CPM.

[36]  Gary Benson,et al.  An algorithm for finding tandem repeats of unspecified pattern size , 1998, RECOMB '98.