Edit-Distance Of Weighted Automata: General Definitions And Algorithms

The problem of computing the similarity between two sequences arises in many areas such as computational biology and natural language processing. A common measure of the similarity of two strings is their edit-distance, that is the minimal cost of a series of symbol insertions, deletions, or substitutions transforming one string into the other. In several applications such as speech recognition or computational biology, the objects to compare are distributions over strings, i.e., sets of strings representing a range of alternative hypotheses with their associated weights or probabilities. We define the edit-distance of two distributions over strings and present algorithms for computing it when these distributions are given by automata. In the particular case where two sets of strings are given by unweighted automata, their edit-distance can be computed using the general algorithm of composition of weighted transducers combined with a single-source shortest-paths algorithm. In the general case, we show tha...

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.

[3]  Arto Salomaa,et al.  Semirings, Automata, Languages , 1985, EATCS Monographs on Theoretical Computer Science.

[4]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[5]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[6]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[9]  Mehryar Mohri,et al.  Generic e-Removal and Input e-Normalization Algorithms for Weighted Transducers , 2002, Int. J. Found. Comput. Sci..

[10]  Mehryar Mohri,et al.  The Design Principles of a Weighted Finite-State Transducer Library , 2000, Theor. Comput. Sci..

[11]  Michael A. Arbib,et al.  An Introduction to Formal Language Theory , 1988, Texts and Monographs in Computer Science.

[12]  Olivier Carton,et al.  Asynchronous sliding block maps , 2000, RAIRO Theor. Informatics Appl..

[13]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[14]  Richard Sproat,et al.  An Efficient Compiler for Weighted Rewrite Rules , 1996, ACL.

[15]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[16]  Mehryar Mohri,et al.  Rational Kernels , 2002, NIPS.

[17]  C. Watkins Dynamic Alignment Kernels , 1999 .

[18]  Jean Berstel,et al.  Rational series and their languages , 1988, EATCS monographs on theoretical computer science.

[19]  Marcel Paul Schützenberger,et al.  Sur une Variante des Fonctions Sequentielles , 1977, Theor. Comput. Sci..

[20]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[21]  Ayumi Shinohara,et al.  Construction of the CDAWG for a Trie , 2001, Stringology.

[22]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[23]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[24]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[25]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[26]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[27]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[28]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.

[29]  Cyril Allauzen,et al.  Efficient Algorithms for Testing the Twins Property , 2003, J. Autom. Lang. Comb..

[30]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[33]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[34]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[35]  Jacques Sakarovitch,et al.  Synchronized Rational Relations of Finite and Infinite Words , 1993, Theor. Comput. Sci..

[36]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[37]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[38]  Thomas Sudkamp,et al.  Languages and Machines , 1988 .

[39]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[40]  Arto Salomaa,et al.  Automata-Theoretic Aspects of Formal Power Series , 1978, Texts and Monographs in Computer Science.

[41]  Vaibhava Goel,et al.  Task dependent loss functions in speech recognition: a* search over recognition lattices , 1999, EUROSPEECH.