Sequence alignment with arbitrary steps and further generalizations, with applications to alignments in linguistics

We provide simple generalizations of the classical Needleman-Wunsch algorithm for aligning two sequences. First, we let both sequences be defined over arbitrary, potentially different alphabets. Secondly, we consider similarity functions between elements of both sequences with ranges in a semiring. Thirdly, instead of considering only 'match', 'mismatch' and 'skip' operations, we allow arbitrary non-negative alignment 'steps'S. Next, we present novel combinatorial formulas for the number of monotone alignments between two sequences for selected steps S. Finally, we illustrate sample applications in natural language processing that require larger steps than available in the original Needleman-Wunsch sequence alignment procedure such that our generalizations can be fruitfully adopted.

[1]  Steffen Eger Restricted Weighted Integer Compositions and Extended Binomial Coefficients , 2013 .

[2]  Grzegorz Kondrak,et al.  Letter-Phoneme Alignment: An Exploration , 2010, ACL.

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Saharon Rosset,et al.  Identifying Bundles of Product Options using Mutual Information Clustering , 2007, SDM.

[7]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[8]  Steffen Eger S-Restricted Monotone Alignments: Algorithm, Search Space, and Applications , 2012, COLING.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[11]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[12]  Steffen Eger The Combinatorics of String Alignments: Reconsidering the Problem* , 2012, J. Quant. Linguistics.

[13]  Michael A. Covington,et al.  The Number of Distinct Alignments of Two Strings , 2004, J. Quant. Linguistics.

[14]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[15]  P. Mousty,et al.  Brulex: une base de donne 'es lexicales informatise 'e pour le franc?ais e 'crit et parle , 1990 .

[16]  Haizhou Li,et al.  Transliteration Alignment , 2009, ACL.

[17]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[18]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[19]  Anil Kumar Singh,et al.  Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training , 2009, HLT-NAACL.

[20]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[21]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[22]  James F. Allen,et al.  Bi-directional conversion between graphemes and phonemes using a joint N-gram model , 2001, SSW.

[23]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[24]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[25]  H. Akaike A new look at the statistical model identification , 1974 .

[26]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[29]  Min-Yen Kan,et al.  A re-examination of lexical association measures , 2009, MWE@IJCNLP.

[30]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[31]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[32]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[33]  Ben Taskar,et al.  Better Alignments = Better Translations? , 2008, ACL.

[34]  B. John Oommen NORTH-HOLLAND String Alignment With Substitution , Insertion , Deletion , 2022 .

[35]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  Zucchini,et al.  An Introduction to Model Selection. , 2000, Journal of mathematical psychology.