Neural Inverse Text Normalization

While there have been several contributions exploring state of the art techniques for text normalization, the problem of inverse text normalization (ITN) remains relatively unexplored. The best known approaches leverage finite state transducer (FST) based models which rely on manually curated rules and are hence not scalable. We propose an efficient and robust neural solution for ITN leveraging transformer based seq2seq models and FST-based text normalization techniques for data preparation. We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them. We then present a hybrid framework for integrating Neural ITN with an FST to overcome common recoverable errors in production environments. Our empirical evaluations show that the proposed solution minimizes incorrect perturbations (insertions, deletions and substitutions) to ASR output and maintains high quality even on out of domain data. A transformer based model infused with pretraining consistently achieves a lower WER across several datasets and is able to outperform baselines on English, Spanish, German and Italian datasets.

[1]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[4]  Boris Ginsburg,et al.  Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Bharat Ram Ambati,et al.  A Mostly Data-Driven Approach to Inverse Text Normalization , 2017, INTERSPEECH.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Yu Shi,et al.  Improving Readability for Automatic Speech Recognition Transcription , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[10]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[11]  Akihiko Takashima,et al.  Large-Context Pointer-Generator Networks for Spoken-to-Written Style Conversion , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[13]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[14]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[15]  Tie-Yan Liu,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Tara N. Sainath,et al.  Improving Performance of End-to-End ASR on Numeric Sequences , 2019, INTERSPEECH.

[18]  Yo Joong Choe,et al.  A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning , 2019, BEA@ACL.

[19]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[20]  Björn Hoffmeister,et al.  Neural Text Normalization with Subword Units , 2019, NAACL.

[21]  Maria Shugrina,et al.  Formatting Time-Aligned ASR Transcripts for Readability , 2010, NAACL.

[22]  Aman Hussain,et al.  Text Normalization using Memory Augmented Neural Networks , 2019, Speech Commun..