Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners

Grapheme-to-phoneme (G2P) transduction is part of the standard text-to-speech (TTS) pipeline. However, G2P conversion is difficult for languages that contain heteronyms -- words that have one spelling but can be pronounced in multiple ways. G2P datasets with annotated heteronyms are limited in size and expensive to create, as human labeling remains the primary method for heteronym disambiguation. We propose a RAD-TTS Aligner-based pipeline to automatically disambiguate heteronyms in datasets that contain both audio with text transcripts. The best pronunciation can be chosen by generating all possible candidates for each heteronym and scoring them with an Aligner model. The resulting labels can be used to create training datasets for use in both multi-stage and end-to-end G2P systems.

[1]  M. Ravanelli,et al.  SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation , 2022, INTERSPEECH.

[2]  David Jurgens,et al.  ByT5 model for massively multilingual grapheme-to-phoneme conversion , 2022, INTERSPEECH.

[3]  Boris Ginsburg,et al.  Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Adrian Lancucki,et al.  One TTS Alignment to Rule Them All , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Tihelka,et al.  T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion , 2021, Interspeech.

[6]  Marco Nicolis,et al.  Homograph disambiguation with contextual word embeddings for TTS systems , 2021, 11th ISCA Speech Synthesis Workshop (SSW 11).

[7]  Boris Ginsburg,et al.  Hi-Fi Multi-Speaker English TTS Dataset , 2021, Interspeech.

[8]  Kevin J. Shih,et al.  RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis , 2021 .

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Kyle Gorman,et al.  Improving homograph disambiguation with supervised machine learning , 2018, LREC.

[11]  Hideharu Nakajima,et al.  Dataset Construction Method for Word Reading Disambiguation , 2018, PACLIC.

[12]  Yu Hu,et al.  Heteronym Verification for Mandarin Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[13]  Mark A. Pitt,et al.  The buckeye corpus of speech: updates and enhancements , 2007, INTERSPEECH.

[14]  David Yarowsky,et al.  Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[15]  K. Matsuoka,et al.  Natural language processing in a Japanese text-to-speech system for written-style texts , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.