Multilingual Grapheme-To-Phoneme Conversion with Byte Representation

Grapheme-to-phoneme (G2P) models convert a written word into its corresponding pronunciation and are essential components in automatic-speech-recognition and text-to-speech systems. Recently, the use of neural encoder-decoder architectures has substantially improved G2P accuracy for mono- and multi-lingual cases. However, most multilingual G2P studies focus on sets of languages that share similar graphemes, such as European languages. Multilingual G2P for languages from different writing systems, e.g. European and East Asian, remains an understudied area. In this work, we propose a multilingual G2P model with byte-level input representation to accommodate different grapheme systems, along with an attention-based Transformer architecture. We evaluate the performance of both character-level and byte-level G2P using data from multiple European and East Asian locales. Models using byte representation yield 16.2%– 50.2% relative word error rate improvement over character-based counterparts for mono- and multi-lingual use cases. In addition, byte-level models are 15.0%–20.1% smaller in size. Our results show that byte is an efficient representation for multilingual G2P with languages having large grapheme vocabularies.

[1]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[2]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[3]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Matt Post,et al.  We start by defining the recurrent architecture as implemented in S OCKEYE , following , 2018 .

[5]  Josef van Genabith,et al.  Massively Multilingual Neural Grapheme-to-Phoneme Conversion , 2017, ArXiv.

[6]  Joachim Köhler,et al.  Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion , 2017, INTERSPEECH.

[7]  Ariya Rastrow,et al.  Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.

[8]  Joan M. Aliprand The Unicode Standard , 1996 .

[9]  Kevin Knight,et al.  Grapheme-to-Phoneme Models for (Almost) Any Language , 2016, ACL.

[10]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[11]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  James A. Reggia,et al.  Empirically derived probabilities for grapheme-to-phoneme correspondences in english , 1987 .

[14]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[15]  Siddharth Dalmia,et al.  Epitran: Precision G2P for Many Languages , 2018, LREC.

[16]  Xu Tan,et al.  Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.

[17]  B. Ideograms The Unicode Standard, Version 12.0 , 2017 .

[18]  Keikichi Hirose,et al.  WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding , 2012, FSMNLP.