Investigating techniques for low resource conversational speech recognition

In this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Subword decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation.

[1]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[2]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[3]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[4]  Roger K. Moore,et al.  Discovering the phoneme inventory of an unwritten language: A machine-assisted approach , 2014, Speech Commun..

[5]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[6]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[9]  Jean-Luc Gauvain,et al.  Active learning based data selection for limited resource STT and KWS , 2015, INTERSPEECH.

[10]  Lukás Burget,et al.  Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge , 2015, INTERSPEECH.

[11]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[12]  Jean-Luc Gauvain,et al.  Improving data selection for low-resource STT and KWS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[15]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[16]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[18]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[20]  Sebastian Stüker,et al.  Training time reduction and performance improvements from multilingual techniques on the BABEL ASR task , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[22]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.