Learning String-Edit Distance

In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[4]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[5]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[6]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[7]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[8]  Lalit R. Bahl,et al.  Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition , 1975, IEEE Trans. Inf. Theory.

[9]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[10]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[11]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[12]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[13]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[14]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[15]  B. John Oommen Constrained string editing , 1986, Inf. Sci..

[16]  from projections. ” IEEE Trans Nucl Sci, 1976; NS-23: 1428-1432 [2] A.P. Dempster, N.M. Laird, D.B. Rubin “Maximum likelihood from incomplete data via , 1991 .

[17]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[18]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  János Csirik,et al.  Parametric string edit distance and its application to pattern recognition , 1995, IEEE Trans. Syst. Man Cybern..

[21]  Fernando Pereira,et al.  The AT&t 60,000 word speech-to-text system , 1995, EUROSPEECH.

[22]  D. Llorens,et al.  Application of Extended Generalized Linear Discriminant Functions (egldf) to Planar Shape Recognition , 1996 .

[23]  B. John Oommen,et al.  Optimal and Information Theoretic Syntactic Pattern Recognition for Traditional Errors , 1996, SSPR.

[24]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[25]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[26]  Eric Sven Ristad,et al.  Hierarchical Non-Emitting Markov Models , 1997, ACL.

[27]  Peter Nicholas Yianilos,et al.  Topics in computational hidden state modeling , 1997 .

[28]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[29]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .