Comparative evaluation of letter-to-sound conversion techniques for English text-to-speech synthesis

Dictionary look-up is the primary strategy for deriving pronunciations for input words in a text-to-speech (TTS) system. This strategy is accurate for dictionary words, but it is not complete: it is impossible to list exhaustively all input words. The proper treatment of `unknown' words is currently an unsolved problem in TTS synthesis. There are many competing techniques for letter-to-sound conversion and the system developer must make a rational selection among them. However, it is unclear how di erent techniques should be properly compared. In this paper, we report a comparative assessment of the competitor methods of letter-to-sound rules, pronunciation by analogy, feedforward neural networks and a k-nearest neighbour method, with respect to their success at automatic phonemisation. This is achieved by using standardised scoring methods, test lexicon and phoneme inventories. The problem of standardising the phoneme set (`harmonisation') is deceptive: this is much harder than at rst appears. The principal nding is that (contrary to the weight of opinion expressed in the literature) data-driven techniques outperform knowledge-based methods by a very signi cant margin.

[1]  Victor Zue,et al.  Reversible letter-to-sound/sound-to-letter generation based on parsing word morpology , 1993, Speech Commun..

[2]  Jared Bernstein,et al.  Performance Comparison of Component Algorithms for the Phonemicization of Orthography , 1981, ACL.

[3]  R. Damper,et al.  Pronunciation by Analogy: Impact of Implementational Choices on Performance , 1997 .

[4]  Mark Bedworth,et al.  NETspeak — A re-implementation of NETtalk , 1987 .

[5]  Paul C. Bagshaw Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression , 1998, Comput. Speech Lang..

[6]  Louis C. W. Pols Quality assessment of text-to-speech synthesis-by-rule , 1991 .

[7]  Anthony J. Vitale,et al.  Algorithms for Grapheme-Phoneme Translation for English and French: Applications for Database Searches and Speech Synthesis , 1997, CL.

[8]  Rodney W. Johnson,et al.  Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[9]  W. Ainsworth A system for converting english text into speech , 1973 .

[10]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[11]  Jan P. H. van Santen Perceptual experiments for diagnostic testing of text-to-speech systems , 1993, Comput. Speech Lang..

[12]  Robert I. Damper,et al.  Evaluating the pronunciation component of a text-to-speech system , 1997 .

[13]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[14]  Louis C. W. Pols,et al.  Evaluating text-to-speech systems: Some methodological aspects , 1990, Speech Commun..

[15]  Walter Daelemans,et al.  IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[16]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[17]  Howard C. Nusbaum,et al.  Pronounce : a program for pronunciation by analogy , 1991 .