Text-To-Phoneme Mapping Using Neural Networks

Text-to-phoneme (TTP) mapping, also called grapheme-to-phoneme (GTP) conversion, defines the process of transforming a written text into its corresponding phonetic transcription. Text-to-phoneme mapping is a necessary step in any state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) system, where the textual information changes dynamically (i.e., new contact entries for name dialing, or new short messages or emails to be read out by a device). There are significant differences between the implementation requirements of a text-to-phoneme mapping module embedded into the automatic speech recognition and into the text-to-speech systems: in automatic speech recognition systems the errors of the text-to-phoneme mapping module are tolerated better (leading to occasional recognition errors) than in the text-to-speech applications, where the effect is immediately and in all cases audible. Automatic speech recognition systems typically use text-to-phoneme mapping to lower the footprint (to avoid storing the lexicon), while maintaining quality. The use of text-to-phoneme mapping in the text-to-speech systems is different. In addition to the phonetic information, the text-to-speech systems also need prosodic information to be able to produce high quality speech, which cannot be predicted by text-to-phoneme mapping. Most state-of-the-art text-to-speech systems use explicit pronunciation lexicon, which is aimed at providing the widest possible coverage, in the order of 100K words, with high quality pronunciation information. Because of this reason, text-to-phoneme mapping is typically used as a fall-back strategy, when the system encounters very rare or non-native words and the quality of a text-to-speech system is indirectly affected by the quality of the grapheme-to-phoneme conversion. Another important issue is the question of training the text-to-phoneme mapping module. The problem of grapheme-to-phoneme conversion is a static one and such a system is trained off-line. The correspondence between the written and spoken form of a language is usually unchanged in the lifetime of an application. So the complexity/speed of the model training is of secondary importance compared to e.g., the speed of convergence or model size. In this thesis, the problem of text-to-phoneme mapping using neural networks is stud-

[1]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[2]  Heinrich Niemann,et al.  SpeeData: multilingual spoken data entry , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[4]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[5]  Jean-Luc Gauvain,et al.  Spoken language processing in a multilingual context , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Alan W. Black,et al.  Letter to sound rules for accented lexicon compression , 1998, ICSLP.

[7]  J. M. Prager Linguini: language identification for multilingual documents , 1999 .

[8]  Anthony Kuh,et al.  A combined self-organizing feature map and multilayer perceptron for isolated word recognition , 1992, IEEE Trans. Signal Process..

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  James A. Reggia,et al.  Learning word pronunciations using a recurrent neural network , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[11]  Stefanos Kollias,et al.  An adaptive least squares algorithm for the efficient training of artificial neural networks , 1989 .

[12]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[13]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Jaakko Astola,et al.  Neural networks with random letter codes for text-to-phoneme mapping and small training dictionary , 2006, 2006 14th European Signal Processing Conference.

[15]  Mingui Sun,et al.  An adaptive training algorithm for back-propagation neural networks , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[16]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[17]  Juha Häkkinen,et al.  Decision tree based text-to-phoneme mapping for speech recognition , 2000, INTERSPEECH.

[18]  Atsunori Ogawa,et al.  Non-native English speech recognition using bilingual English lexicon and acoustic models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Elmar Nöth,et al.  Comparison of two tree-structured approaches for grapheme-to-phoneme conversion , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Dau-Cheng Lyu,et al.  A Taiwanese text-to-speech system with applications to language learning , 2004, IEEE International Conference on Advanced Learning Technologies, 2004. Proceedings..

[21]  Jilei Tian,et al.  Scalable neural network based language identification from written text , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Tomaz Sef Slovenian text-to-speech system , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[23]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[24]  K. M. Curtis,et al.  A hybrid neural network/rule based architecture used as a text to phoneme transcriber , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[25]  Danilo P. Mandic,et al.  A normalized gradient algorithm for an adaptive recurrent perceptron , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[26]  F. Mihelic,et al.  Multilingual spoken dialog system , 1999, ISIE '99. Proceedings of the IEEE International Symposium on Industrial Electronics (Cat. No.99TH8465).

[27]  Jaakko Astola,et al.  A Hybrid Approach to Bilingual Text-To-Phoneme Mapping , 2008 .

[28]  D.C. Silva,et al.  A rule-based grapheme-phone converter and stress determination for Brazilian Portuguese natural language processing , 2006, 2006 International Telecommunications Symposium.

[29]  J. Zibert,et al.  Bilingual speech recognition of Slovenian and Croatian weather forecasts , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[30]  Imre Kiss,et al.  Speaker- and language-independent speech recognition in mobile communication systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[31]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[32]  Tieniu Tan Written language recognition based on texture analysis , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[33]  Min Yao,et al.  A High Accuracy Approach for Word-Phoneme Translation Using Neural Networks , 2005, 2005 International Conference on Neural Networks and Brain.

[34]  June Ho Park,et al.  Adaptive Hopfield neural networks for economic load dispatch , 1998 .

[35]  Jukka Saarinen,et al.  A study on different neural network architectures applied to text-to-phoneme mapping , 2003, 3rd International Symposium on Image and Signal Processing and Analysis, 2003. ISPA 2003. Proceedings of the.

[36]  Rüdiger Hoffmann,et al.  A multilingual TTS system with less than 1 Mbyte footprint for embedded applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[37]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[38]  J. Saarinen,et al.  A hybrid neural network/rule based system for bilingual text-to-phoneme mapping , 2004, Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004..

[39]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[40]  John Goldsmith,et al.  Dealing with Prosody in a Text-to-Speech System , 1999, Int. J. Speech Technol..

[41]  Mark Bedworth,et al.  NETspeak — A re-implementation of NETtalk , 1987 .

[42]  Jukka Saarinen,et al.  Application of the neural networks for text-to-phoneme mapping , 2002, 2002 11th European Signal Processing Conference.

[43]  Mark J. Embrechts,et al.  Neural networks for text-to-speech phoneme recognition , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[44]  Mark J. Embrechts,et al.  Phoneme recognition with staged neural networks , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[45]  Jin-Hau Kuo,et al.  Coded block neural network VLSI system using an adaptive learning-rate technique to train Chinese character patterns , 1992 .

[46]  Jaakko Astola,et al.  Comparative study of letter encoding for text-to-phoneme mapping , 2005, 2005 13th European Signal Processing Conference.

[47]  W. K. Jenkins,et al.  The use of orthogonal transforms for improving performance of adaptive filters , 1989 .

[48]  Ujjwal Bhattacharya,et al.  Self-adaptive learning rates in backpropagation algorithm improve its function approximation performance , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[49]  J. Suontausta,et al.  Low memory decision tree method for text-to-phoneme mapping , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[50]  Bin-Da Liu,et al.  A backpropagation algorithm with adaptive learning rate and momentum coefficient , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[51]  E.B. Bilcu,et al.  Improved Hybrid Approach for Bilingual Language Recognition from Text , 2007, 2007 5th International Symposium on Image and Signal Processing and Analysis.

[52]  Robert I. Damper,et al.  Aligning Text and Phonemes for Speech Technology Applications Using an EM-Like Algorithm , 2005, Int. J. Speech Technol..

[53]  Noel Massey,et al.  A high quality text-to-speech system composed of multiple neural networks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[54]  Françoise Beaufays,et al.  Transform-domain adaptive filters: an analytical approach , 1995, IEEE Trans. Signal Process..

[55]  E.B. Bilcu,et al.  A Hybrid Neural Network for Language Identification from Text , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[56]  Søren Riis,et al.  Self-organizing letter code-book for text-to-phoneme neural network model , 2000, INTERSPEECH.

[57]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[58]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[59]  B. Van Coile Inductive learning of pronunciation rules with the Depes system , 1991, ICASSP.

[60]  Eddie Wong,et al.  Three approaches to multilingual phone recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[61]  Robert I. Damper,et al.  A recurrent network that learns to pronounce English text , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[62]  Tanja Schultz,et al.  Towards universal speech recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[63]  F. Korkmazskiy Statistical learning of language pronunciation structure , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[64]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[65]  Joram Meron,et al.  Compression of exception lexicons for small footprint grapheme-to-phoneme conversion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[66]  Reza Safabakhsh,et al.  TASOM: a new time adaptive self-organizing map , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[67]  Yutaka Fukui,et al.  Transform domain neural filters , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[68]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[69]  D. Braga,et al.  A rule-based grapheme-to-phone converter for tts systems in european portuguese , 2006, 2006 International Telecommunications Symposium.

[70]  A.W. Black,et al.  Unit selection without a phoneme set , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..