Aligning Text and Phonemes for Speech Technology Applications Using an EM-Like Algorithm

A common requirement in speech technology is to align two different symbolic representations of the same linguistic ‘message’. For instance, we often need to align letters of words listed in a dictionary with the corresponding phonemes specifying their pronunciation. As dictionaries become ever bigger, manual alignment becomes less and less tenable yet automatic alignment is a hard problem for a language like English. In this paper, we describe the use of a form of the expectation-maximization (EM) algorithm to learn alignments of English text and phonemes, starting from a variety of initializations. We use the British English Example Pronunciation (BEEP) dictionary of almost 200,000 words in this work. The quality of alignment is difficult to determine quantitatively since no ‘gold standard’ correct alignment exists. We evaluate the success of our algorithm indirectly from the performance of a pronunciation by analogy system using the aligned dictionary data as a knowledge base for inferring pronunciations. We find excellent performance—the best so far reported in the literature. There is very little dependence on the start point for alignment, indicating that the EM search space is strongly convex. Since the aligned BEEP dictionary is a potentially valuable resource, it is made freely available for research use.

[1]  Howard C. Nusbaum,et al.  Pronounce : a program for pronunciation by analogy , 1991 .

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  S. G. C. Lawrence,et al.  Alignment of phonemes with their corresponding orthography , 1986 .

[4]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[5]  Robert I. Damper,et al.  A multistrategy approach to improving pronunciation by analogy , 2000, CL.

[6]  Kate Knill,et al.  Hidden Markov Models in Speech and Language Processing , 1997 .

[7]  R. Damper,et al.  Pronunciation by Analogy: Impact of Implementational Choices on Performance , 1997 .

[8]  H. Hartley Maximum Likelihood Estimation from Incomplete Data , 1958 .

[9]  Kirk P. H. Sullivan Analogy, the Corpus and Pronunciation , 2001 .

[10]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[11]  François Yvon Grapheme-to-Phoneme Conversion using Multiple Unbounded Overlapping Chunks , 1996, ArXiv.

[12]  Mark Bedworth,et al.  NETspeak — A re-implementation of NETtalk , 1987 .

[13]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[14]  Robert I. Damper,et al.  Pronouncing Text by Analogy , 1996, COLING.

[15]  R. A. Sharman,et al.  A bi-directional model of English pronunciation , 1991, EUROSPEECH.

[16]  Paul C. Bagshaw Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression , 1998, Comput. Speech Lang..

[17]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[18]  R. I. Damper,et al.  Stochastic phonographic transduction for English , 1996, Comput. Speech Lang..

[19]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences by delimiting and dynamic time warping techniques , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[21]  David L. Neuhoff,et al.  The Viterbi algorithm as an aid in text recognition (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[22]  François Yvon Prononcer par analogie : motivation, formalisation et evaluation , 1996 .

[23]  Max Coltheart Writing Systems and Reading Disorders , 1984 .

[24]  Vito Pirrelli,et al.  The hidden dimension: a paradigmatic view of data-driven NLP , 1999, J. Exp. Theor. Artif. Intell..

[25]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[26]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[27]  Vito Pirrelli,et al.  Advances in Analogy-Based Learning: False Friends and Exceptional Items in Pronunciation By Paradigm , 1995, IJCAI 1995.

[28]  Geir Gunnarsson Data Driven Methods in Speech Synthesis , 2005 .

[29]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[30]  Robert I. Damper,et al.  Evaluating the pronunciation component of text-to-speech systems for English: a performance comparison of different approaches , 1999, Comput. Speech Lang..

[31]  Vito Pirrelli,et al.  "you'd Better Say Nothing than Say Something Wrong": Analogy, Accuracy and Text-to-speech Applications , 1995, EUROSPEECH.

[32]  Steve Young,et al.  Corpus-based methods in language and speech processing , 1997 .

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences with pre-defined consonant and vowel patterns , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Martin Jansche Re-Engineering Letter-to-Sound Rules , 2001, NAACL.

[36]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[37]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[38]  Robert I. Damper,et al.  A novel approach to inferring letter-phoneme correspondences , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[39]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[40]  Robert I. Damper,et al.  Novel-word pronunciation: A cross-language study , 1993, Speech Commun..

[41]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Robert I. Damper,et al.  Computational complexity of a fast Viterbi decoding algorithm for stochastic letter-phoneme transduction , 1998, IEEE Trans. Speech Audio Process..

[43]  R. Venezky The Structure of English Orthography , 1965 .

[44]  Edward Carney,et al.  A Survey of English Spelling , 1993 .

[45]  M. Coltheart Lexical access in simple reading tasks , 1978 .