A Unified Transformer-based Framework for Duplex Text Normalization

Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for textto-speech synthesis and automatic speech recognition, respectively. Many methods have been proposed for either TN or ITN, ranging from weighted finite-state transducers to neural networks. Despite their impressive performance, these methods aim to tackle only one of the two tasks but not both. As a result, in a complete spoken dialog system, two separate models for TN and ITN need to be built. This heterogeneity increases the technical complexity of the system, which in turn increases the cost of maintenance in a production setting. Motivated by this observation, we propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN. Combined with a simple but effective data augmentation method, our systems achieve state-of-the-art results on the Google TN dataset for English and Russian. They can also reach over 95% sentence-level accuracy on an internal English TN dataset without any additional fine-tuning. In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset1. Overall, experimental results demonstrate the proposed duplex text normalization framework is highly effective and applicable to a range of domains and languages 2.

[1]  Richard Sproat,et al.  The Kestrel TTS text normalization system , 2014, Natural Language Engineering.

[2]  Boris Ginsburg,et al.  NeMo Inverse Text Normalization: From Development To Production , 2021, Interspeech 2021.

[3]  Katrin Kirchhoff,et al.  Neural Inverse Text Normalization , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Minchuan Chen,et al.  Improving Neural Text Normalization with Partial Parameter Generator and Pointer-Generator Network , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Quan Hung Tran,et al.  A Context-Dependent Gated Module for Incorporating Symbolic Semantics into Event Coreference Resolution , 2021, NAACL.

[6]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[7]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[8]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[12]  Björn Hoffmeister,et al.  Neural Text Normalization with Subword Units , 2019, NAACL.

[13]  Navdeep Jaitly,et al.  An RNN Model of Text Normalization , 2017, INTERSPEECH.

[14]  Antonio Bonafonte,et al.  Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems , 2021, NAACL.

[15]  Kangping Wang,et al.  A Text Normalization Method for Speech Synthesis Based on Local Attention Mechanism , 2020, IEEE Access.

[16]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[17]  Richard Sproat,et al.  Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems , 2018, SLTU.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[20]  Brian Roark,et al.  The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.

[21]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  Timo Baumann,et al.  Mining the Spoken Wikipedia for Speech Data and Beyond , 2016, LREC.

[23]  Quan Hung Tran,et al.  Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference , 2021, ACL.

[24]  Géza Németh,et al.  Text normalization with convolutional neural networks , 2018, Int. J. Speech Technol..