论文信息 - Multimodal, Multilingual Grapheme-to-Phoneme Conversion for Low-Resource Languages

Multimodal, Multilingual Grapheme-to-Phoneme Conversion for Low-Resource Languages

Grapheme-to-phoneme conversion (g2p) is the task of predicting the pronunciation of words from their orthographic representation. His- torically, g2p systems were transition- or rule- based, making generalization beyond a mono- lingual (high resource) domain impractical. Recently, neural architectures have enabled multilingual systems to generalize widely; however, all systems to date have been trained only on spelling-pronunciation pairs. We hy- pothesize that the sequences of IPA characters used to represent pronunciation do not capture its full nuance, especially when cleaned to fa- cilitate machine learning. We leverage audio data as an auxiliary modality in a multi-task training process to learn a more optimal inter- mediate representation of source graphemes; this is the first multimodal model proposed for multilingual g2p. Our approach is highly ef- fective: on our in-domain test set, our mul- timodal model reduces phoneme error rate to 2.46%, a more than 65% decrease compared to our implementation of a unimodal spelling- pronunciation model—which itself achieves state-of-the-art results on the Wiktionary test set. The advantages of the multimodal model generalize to wholly unseen languages, reduc- ing phoneme error rate on our out-of-domain test set to 6.39% from the unimodal 8.21%, a more than 20% relative decrease. Further- more, our training and test sets are composed primarily of low-resource languages, demon- strating that our multimodal approach remains useful when training data are constrained.

[1] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Goutam Saha,et al. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[3] James R. Glass,et al. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[4] Thaweesak Yingthawornsuk,et al. Speech Recognition using MFCC , 2012 .

[5] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6] Alan W. Black,et al. CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Kevin Knight,et al. Grapheme-to-Phoneme Models for (Almost) Any Language , 2016, ACL.

[8] Karen Livescu,et al. Jointly learning to align and convert graphemes to phonemes with neural attention models , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[10] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Grzegorz Kondrak,et al. Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[13] Terrence J. Sejnowski,et al. NETtalk: a parallel network that learns to read aloud , 1988 .

[14] Siddharth Dalmia,et al. Epitran: Precision G2P for Many Languages , 2018, LREC.

[15] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[16] Barnabás Póczos,et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[17] Barnabás Póczos,et al. Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis , 2018, ArXiv.

[18] 张国亮,et al. Comparison of Different Implementations of MFCC , 2001 .

[19] Josef van Genabith,et al. Massively Multilingual Neural Grapheme-to-Phoneme Conversion , 2017, ArXiv.

[20] Vaibhava Goel,et al. Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] John C. Wells,et al. Computer-coding the IPA: a proposed extension of SAMPA , 1995 .

[22] Stanley F. Chen,et al. Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[23] Prateek Verma,et al. Audio-linguistic Embeddings for Spoken Sentences , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Hermann Ney,et al. Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[25] Zheng Fang,et al. Comparison of different implementations of MFCC , 2001 .

[26] Nikos Fakotakis,et al. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .