Converting English text to speech: a machine learning approach

The task of mapping spelled English words into strings of phonemes and stresses ("reading aloud") has many practical applications. Several commercial systems perform this task by applying a knowledge base of expert-supplied letter-to-sound rules. This dissertation presents a set of machine learning methods for automatically constructing letter-to-sound rules by analyzing a dictionary of words and their pronunciations. Taken together, these methods provide a substantial performance improvement over the best commercial system--DECtalk from Digital Equipment Corporation. In a performance test, the learning methods were trained on a dictionary of 19,002 words. Then, human subjects were asked to compare the performance of the resulting letter-to-sound rules against the dictionary for an additional 1,000 words not used during training. In a blind procedure, the subjects rated the pronunciations of both the learned rules and the DECtalk rules according to whether they were noticably different from the dictionary pronunciation. The error rate for the learned rules was 28.8% (288 words noticeably different), while the error rate for the DECtalk rules was 44.3% (433 words noticeably different). If, instead of using human judges, were required that the pronunciations of the letter-to-sound rules exactly match the dictionary to be counted correct, then the error rate for our learned rules is 35.2% and the error rate for DECtalk is 63.6%. Similar results were observed at the level of individual letters, phonemes, and stresses. To achieve these results, several techniques were combined. The key learning technique represents the output classes by the codewords of an error-correcting code. Boolean concept learning methods, such as the standard ID3 decision-tree algorithm, can be applied to learn the individual bits of these codewords. This converts the muticlass learning problem into a number of boolean concept learning problems. This method is shown to be superior to several other methods: multiclass ID3, one-tree-per-class ID3, the domain-specific distributed code employed by T. Sejnowski and C. Rosenberg in their NETtalk system, and a method developed by D. Wolpert. Similar results in the domain of isolated-letter speech recognition with the backpropagation algorithm show that error-correcting output codes provide a domain-independent, algorithm-independent approach to multiclass learning problems.

[1]  Dwijendra K. Ray-Chaudhuri,et al.  Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[2]  D. Klatt Synthesis by rule of consonant‐vowel syllables , 1978 .

[3]  John Gaschnig,et al.  MODEL DESIGN IN THE PROSPECTOR CONSULTANT SYSTEM FOR MINERAL EXPLORATION , 1981 .

[4]  David W. Shipman,et al.  Letter‐to‐phoneme rules: A semi‐automatic discovery procedure , 1982 .

[5]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[8]  John M Lucassen Discovering phonemic base forms automatically : an information theoretic approach , 1983 .

[9]  Robert L. Mercer,et al.  An information theoretic approach to the automatic determination of phonemic baseforms , 1984, ICASSP.

[10]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[11]  Leslie G. Valiant,et al.  Learning Disjunction of Conjunctions , 1985, IJCAI.

[12]  Kenneth Ward Church Stress assignment in letter‐to‐sound rules for speech synthesis , 1985 .

[13]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[14]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[15]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[16]  Paul E. Utgoff,et al.  Perceptron Trees : A Case Study in ybrid Concept epresentations , 1999 .

[17]  Marvin Minsky,et al.  Perceptrons: expanded edition , 1988 .

[18]  Terrence J. Sejnowski,et al.  A Parallel Network that Learns to Play Backgammon , 1989, Artif. Intell..

[19]  Raymond J. Mooney,et al.  An Experimental Comparison of Symbolic and Connectionist Learning Algorithms , 1989, IJCAI.

[20]  Hermann Hild,et al.  Variations on ID3 for text-to-speech conversion , 1989 .

[21]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[22]  Thomas G. Dietterich Limitations on Inductive Learning , 1989, ML.

[23]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[24]  David H. Wolpert,et al.  Constructing a generalizer superior to NETtalk via a mathematical theory of generalization , 1990, Neural Networks.

[25]  David H. Wolpert,et al.  A Mathematical Theory of Generalization: Part I , 1990, Complex Syst..

[26]  Thomas G. Dietterich,et al.  A Comparative Study of ID3 and Backpropagation for English Text-to-Speech Mapping , 1990, ML.

[27]  Ron Cole,et al.  The ISOLET spoken letter database , 1990 .

[28]  David H. Wolpert,et al.  A Mathematical Theory of Generalization: Part II , 1990, Complex Syst..

[29]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[30]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .