论文信息 - Machine translation based data augmentation for Cantonese keyword spotting

Machine translation based data augmentation for Cantonese keyword spotting

This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.

[1] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[2] Jean-Luc Gauvain,et al. Rapid development of a Latvian speech-to-text system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] John Lee. Toward a Parallel Corpus of Spoken Cantonese and Written Chinese , 2011, IJCNLP.

[4] Lori Lamel,et al. Pronunciation variants generation using SMT-inspired approaches , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6] Richard M. Schwartz,et al. Combination of search techniques for improved spotting of OOV keywords , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Mark J. F. Gales,et al. Improving speech recognition and keyword search for low resource languages using web data , 2015, INTERSPEECH.

[8] Jean-Luc Gauvain,et al. On the Use of MLP Features for Broadcast News Transcription , 2008, TSD.

[9] Mei-Yuh Hwang,et al. Web-data augmented language models for Mandarin conversational speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10] Jan Cernocký,et al. BUT BABEL system for spontaneous Cantonese , 2013, INTERSPEECH.

[11] Mark J. F. Gales,et al. Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12] Pascale Fung,et al. Cross-Lingual Language Modeling for Low-Resource Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Frantisek Grézl,et al. Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Xiaodong Cui,et al. A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Andreas Stolcke,et al. Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[16] Jean-Luc Gauvain,et al. Developing STT and KWS systems using limited language resources , 2014, INTERSPEECH.

[17] Virginia Yip,et al. Cantonese: A Comprehensive Grammar , 1994 .

[18] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[19] Lori Lamel,et al. A First LVCSR System for Luxembourgish, a Low-Resourced European Language , 2011, LTC.

[20] Xiaodong Cui,et al. DEVELOPING KEYWORD SEARCH UNDER THE IARPA BABEL PROGRAM , 2013 .

[21] Xiaodong Cui,et al. Developing speech recognition systems for corpus indexing under the IARPA Babel program , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Jean-Luc Gauvain,et al. Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[23] Mark J. F. Gales,et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[24] Mauro Cettolo,et al. IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.