Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-Trained BERT

Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from raw Chinese character sequence and the NN based classifier predicts the polyphonic character’s pronunciation according to BERT output. To explore the impact of contextual information on polyphone disambiguation, three different classifiers are investigated: a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. Experimental results demonstrate the effectiveness of the proposed end-to-end framework for polyphone disambiguation and the semantic features extracted by BERT can greatly enhance the performance.

[1]  Sin-Horng Chen,et al.  The broad study of homograph disambiguity for Mandarin speech synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Kaisheng Yao,et al.  A bi-directional LSTM approach for polyphone disambiguation in Mandarin Chinese , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[7]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[8]  Xiaoyong Du,et al.  Analogical Reasoning on Chinese Morphological and Semantic Relations , 2018, ACL.

[9]  Ming Zhou,et al.  Close to Human Quality TTS with Transformer , 2018, ArXiv.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.