论文信息 - Converting Written Language to Spoken Language with Neural Machine Translation for Language Modeling

Converting Written Language to Spoken Language with Neural Machine Translation for Language Modeling

When building a language model (LM) for spontaneous speech, the ideal situation is to have a large amount of spoken, in-domain training data. Having such abundant data, however, is not realistic. We address this problem by generating texts in spoken language from those in written language by using a neural machine translation (NMT) model. We collected faithful transcripts of fully spontaneous speech and corresponding written versions and used them as a parallel corpus to train the NMT model. We used top-k random sampling, which generates a large variety of texts of higher quality as compared to other generation methods for NMT. We indicate that the NMT model is capable of converting written texts in a certain domain to spoken texts, and that the converted texts are effective for training LMs. Our experimental results show significant improvement of speech recognition accuracy with the LMs.

Nobuaki Minematsu | Masayuki Suzuki | Nobuyasu Itoh | Gakuto Kurata | Shintaro Ando

[1] Maya R. Gupta,et al. Filtering web text to match target genres , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[3] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[4] Kengo Ohta,et al. Evaluating spoken language model based on filler prediction model in speech recognition , 2008, INTERSPEECH.

[5] Nobuyasu Itoh,et al. Named Entity Recognition from Conversational Telephone Speech in Japanese , 2010 .

[6] Tatsuya Kawahara,et al. Statistical Transformation of Language and Pronunciation Models for Spontaneous Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Shankar Kumar,et al. Approaches for Neural-Network Language Model Adaptation , 2017, INTERSPEECH.

[8] Elizabeth Shriberg. DISFLUENCIES IN SWITCHBOARD , 1996 .

[9] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.

[10] Ashwin K. Vijayakumar,et al. Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[11] Masayuki Suzuki,et al. Improvements to N-gram Language Model Using Text Generated from Neural Language Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Denny Britz,et al. Generating Long and Diverse Responses with Neural Conversation Models , 2017, ArXiv.

[13] Tetsuji Ogawa,et al. Language Model Domain Adaptation Via Recurrent Neural Networks with Domain-Shared and Domain-Specific Representations , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Tatsuya Kawahara,et al. Automatic Transcription of Lecture Speech using Language Model Based on Speaking-Style Transformation of Proceeding Texts , 2012, INTERSPEECH.

[15] Daniel Gildea,et al. Forms of English Function Words — Effects of Disfluencies , Turn Position , Age and Sex , and Predictability , 1999 .

[16] George Saon,et al. Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17] Tatsuya Kawahara,et al. A WFST-based log-linear framework for speaking-style transformation , 2009, INTERSPEECH.

[18] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[19] Changhan Wang,et al. Levenshtein Transformer , 2019, NeurIPS.

[20] Mei-Yuh Hwang,et al. Web-data augmented language models for Mandarin conversational speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[21] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.