Converting Written Language to Spoken Language with Neural Machine Translation for Language Modeling

When building a language model (LM) for spontaneous speech, the ideal situation is to have a large amount of spoken, in-domain training data. Having such abundant data, however, is not realistic. We address this problem by generating texts in spoken language from those in written language by using a neural machine translation (NMT) model. We collected faithful transcripts of fully spontaneous speech and corresponding written versions and used them as a parallel corpus to train the NMT model. We used top-k random sampling, which generates a large variety of texts of higher quality as compared to other generation methods for NMT. We indicate that the NMT model is capable of converting written texts in a certain domain to spoken texts, and that the converted texts are effective for training LMs. Our experimental results show significant improvement of speech recognition accuracy with the LMs.

[1]  Maya R. Gupta,et al.  Filtering web text to match target genres , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Kengo Ohta,et al.  Evaluating spoken language model based on filler prediction model in speech recognition , 2008, INTERSPEECH.

[5]  Nobuyasu Itoh,et al.  Named Entity Recognition from Conversational Telephone Speech in Japanese , 2010 .

[6]  Tatsuya Kawahara,et al.  Statistical Transformation of Language and Pronunciation Models for Spontaneous Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Shankar Kumar,et al.  Approaches for Neural-Network Language Model Adaptation , 2017, INTERSPEECH.

[8]  Elizabeth Shriberg DISFLUENCIES IN SWITCHBOARD , 1996 .

[9]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[10]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[11]  Masayuki Suzuki,et al.  Improvements to N-gram Language Model Using Text Generated from Neural Language Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Denny Britz,et al.  Generating Long and Diverse Responses with Neural Conversation Models , 2017, ArXiv.

[13]  Tetsuji Ogawa,et al.  Language Model Domain Adaptation Via Recurrent Neural Networks with Domain-Shared and Domain-Specific Representations , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Tatsuya Kawahara,et al.  Automatic Transcription of Lecture Speech using Language Model Based on Speaking-Style Transformation of Proceeding Texts , 2012, INTERSPEECH.

[15]  Daniel Gildea,et al.  Forms of English Function Words — Effects of Disfluencies , Turn Position , Age and Sex , and Predictability , 1999 .

[16]  George Saon,et al.  Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Tatsuya Kawahara,et al.  A WFST-based log-linear framework for speaking-style transformation , 2009, INTERSPEECH.

[18]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[19]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[20]  Mei-Yuh Hwang,et al.  Web-data augmented language models for Mandarin conversational speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[21]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.