Aligning letters and phonemes for speech synthesis

A common requirement in speech technology is to align two different symbolic representations of the same linguistic ‘message’. For instance, we often need to align letters of words listed in a dictionary with the corresponding phonemes specifying their pronunciation. As dictionaries become ever bigger, manual alignment becomes less and less tenable yet automatic alignment is a hard problem for a language like English. In this paper, we describe use of a form of the expectation-maximization (EM) algorithm to achieve automatic alignment of English text and phonemes. The quality of alignment is assessed by the performance of a pronunciation by analogy system using the aligned dictionary data. We find excellent performance—the best so far reported in the literature of letter-phoneme conversion—independent of the start point for alignment, indicating that the EM search space is strongly convex.

[1]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[2]  R. Damper,et al.  Pronunciation by Analogy: Impact of Implementational Choices on Performance , 1997 .

[3]  S. G. C. Lawrence,et al.  Alignment of phonemes with their corresponding orthography , 1986 .

[4]  Howard C. Nusbaum,et al.  Pronounce : a program for pronunciation by analogy , 1991 .

[5]  Robert I. Damper,et al.  A multistrategy approach to improving pronunciation by analogy , 2000, CL.

[6]  Mark Bedworth,et al.  NETspeak — A re-implementation of NETtalk , 1987 .

[7]  François Yvon Prononcer par analogie : motivation, formalisation et evaluation , 1996 .

[8]  MarchandYannick,et al.  A multistrategy approach to improving pronunciation by analogy , 2000 .

[9]  Max Coltheart Writing Systems and Reading Disorders , 1984 .

[10]  Robert I. Damper,et al.  Evaluating the pronunciation component of text-to-speech systems for English: a performance comparison of different approaches , 1999, Comput. Speech Lang..

[11]  Vito Pirrelli,et al.  "you'd Better Say Nothing than Say Something Wrong": Analogy, Accuracy and Text-to-speech Applications , 1995, EUROSPEECH.

[12]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[13]  R. Bellman Dynamic programming. , 1957, Science.

[14]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  R. Venezky The Structure of English Orthography , 1965 .

[16]  Edward Carney,et al.  A Survey of English Spelling , 1993 .

[17]  Kirk P. H. Sullivan Analogy, the Corpus and Pronunciation , 2001 .

[18]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[19]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[20]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[21]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences by delimiting and dynamic time warping techniques , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Robert I. Damper,et al.  Novel-word pronunciation: A cross-language study , 1993, Speech Commun..

[23]  M. Coltheart Lexical access in simple reading tasks , 1978 .

[24]  Geir Gunnarsson Data Driven Methods in Speech Synthesis , 2005 .