Stochastic phonographic transduction for English

Abstract This paper introduces and reviews stochastic phonographic transduction (SPT), a trainable (“data-driven”) technique for letter-to-phoneme conversion based on formal language theory, as well as describing and detailing one particularly simple realization of SPT. The spellings and pronunciations of English words are modelled as the productions of a stochastic grammar, inferred from example data in the form of a pronouncing dictionary. The terminal symbols of the grammar are letter–phoneme correspondences, and the rewrite (production) rules of the grammar specify how these are combined to form acceptable English word spellings and their pronunciations. Given the spelling of a word as input, a pronunciation can then be produced as output by parsing the input string according to the letter-part of the terminals and selecting the “best” sequence of corresponding phoneme-parts according to some well-motivated criteria. Although the formalism is in principle very general, restrictive assumptions must be made if practical, trainable systems are to be realized. We have assumed at this stage that the grammar is regular. Further, word generation is modelled as a Markov process in which terminals (correspondences) are simply concatenated. The SPT learning task then amounts to the inference of a set of correspondences and estimation from the training data of their associated transition probabilities. Transduction to produce a pronunciation for a word given its spelling is achieved by Viterbi decoding, using a maximum likelihood criterion. Results are presented for letter–phoneme alignment and transduction for the dictionary training data, unseen dictionary words, unseen proper nouns and novel (pseudo-)words. Two different ways of inferring correspondences are described and compared. It is found that the provision of quite limited information about the alternating vowel/consonant structure of words aids the inference process significantly. Best transduction performance obtained on unseen dictionary words is 93·7% phonemes correct, conservatively scored. Automatically inferred correspondences also consistently out-perform a published set of manually derived correspondences when used for SPT. Although the comparison is difficult to make, we believe that current results for letter-to-phoneme conversion are at least as good as the best reported so far for a data-driven approach, while being comparable in performance to knowledge-based approaches.

[1]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[2]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[3]  Victor Zue,et al.  Reversible letter-to-sound/sound-to-letter generation based on parsing word morpology , 1993, Speech Commun..

[4]  Max Coltheart Writing Systems and Reading Disorders , 1984 .

[5]  Sheri Hunnicutt Grapheme-to-phoneme rules: A review , 1980 .

[6]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[7]  M. S. Hunnicutt,et al.  Phonological Rules For A Text To Speech Sytem , 1979, ACL Microfiche Series 1-83, Including Computational Linguistics.

[8]  R. A. Sharman,et al.  A bi-directional model of English pronunciation , 1991, EUROSPEECH.

[9]  W. Ainsworth A system for converting english text into speech , 1973 .

[10]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[11]  R. Glushko The Organization and Activation of Orthographic Knowledge in Reading Aloud. , 1979 .

[12]  D. G. Scragg,et al.  A History of English Spelling , 1976 .

[13]  R. Venezky The Structure of English Orthography , 1965 .

[14]  Robert Wing Pong Luk Stochastic transduction for English grapheme-to-phoneme conversion , 1992 .

[15]  Walter Daelemans,et al.  Data-Oriented Methods for Grapheme-to-Phoneme Conversion , 1993, EACL.

[16]  Briony Williams Welsh letter-to-sound rules: rewrite rules and two-level rules compared , 1994, Comput. Speech Lang..

[17]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[18]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences with pre-defined consonant and vowel patterns , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Victor Zue,et al.  Phonological parsing for reversible letter-to-sound/sound-to-letter generation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Steve J. Young,et al.  An inference approach to grammar construction , 1995, Comput. Speech Lang..

[21]  Robert I. Damper,et al.  A psychologically-governed approach to novel-word pronunciation within a text-to-speech system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[23]  Robert I. Damper,et al.  Experiments with silent-e and affix correspondences in stochastic phonographic transduction , 1993, EUROSPEECH.

[24]  Robert I. Damper,et al.  Novel-word pronunciation: A cross-language study , 1993, Speech Commun..

[25]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[26]  Robert I. Damper,et al.  Stochastic transduction for English text-to-phoneme conversion , 1991, EUROSPEECH.

[27]  Leslie Henderson,et al.  On the use of the term ‘grapheme’ , 1985 .

[28]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[29]  Howard C. Nusbaum,et al.  Pronounce : a program for pronunciation by analogy , 1991 .

[30]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Simon M. Lucas,et al.  Syntactic neural networks for bidirectional text-phonetics translation , 1992 .

[32]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[33]  Louis C. W. Pols Assessment of text-to-speech synthesis systems , 1989 .

[34]  Kenneth Ward Church,et al.  Morphology and rhyming: two powerful alternatives to letter-to-sound rules for speech synthesis , 1990, SSW.

[35]  C. Douglas Johnson,et al.  Formal Aspects of Phonological Description , 1972 .

[36]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[37]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences by delimiting and dynamic time warping techniques , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Michael G. Thomason,et al.  Syntactic Pattern Recognition, An Introduction , 1978, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  S. G. C. Lawrence,et al.  Alignment of phonemes with their corresponding orthography , 1986 .

[40]  Robert I. Damper,et al.  Inference of letter-phoneme correspondences using generalised stochastic transducers , 1992 .

[41]  René Dirven,et al.  A first dictionary of linguistics and phonetics , 1982 .

[42]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey-Part II , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Robert I. Damper Self-learning and connectionist approaches to text-phoneme conversion , 1995 .

[44]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[45]  Enrique Vidal,et al.  Grammatical Inference: An Introduction Survey , 1994, ICGI.

[46]  Robert I. Damper,et al.  A modification of the viterbi algorithm for stochastic phonographic transduction , 1992, ICSLP.