Active learning based data selection for limited resource STT and KWS

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPABabel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPABabel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation. Index Terms: active learning, low-resourced STT, KWS.

[1]  Sebastian Stüker,et al.  Training time reduction and performance improvements from multilingual techniques on the BABEL ASR task , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[3]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[4]  Jean-Luc Gauvain,et al.  Developing STT and KWS systems using limited language resources , 2014, INTERSPEECH.

[5]  Rong Zhang,et al.  Data selection for speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[7]  Roger K. Moore,et al.  Discovering the phoneme inventory of an unwritten language: A machine-assisted approach , 2014, Speech Commun..

[8]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[9]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[10]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[11]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[13]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[14]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[15]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Jeff A. Bilmes,et al.  A Submodularity Framework for Data Subset Selection , 2013 .

[17]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[18]  Jeff A. Bilmes,et al.  Submodular feature selection for high-dimensional acoustic score spaces , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[20]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[21]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.