Very Low Resource Radio Browsing for Agile Developmental and Humanitarian Monitoring

We present a radio browsing system developed on a very small corpus of annotated speech by using semi-supervised training of multilingual DNN/HMM acoustic models. This system is intended to support relief and developmental programmes by the United Nations (UN) in parts of Africa where the spoken languages are extremely under resourced. We assume the availability of 12 minutes of annotated speech in the target language, and show how this can best be used to develop an acoustic model. First, a multilingual DNN/HMM is trained using Acholi as the target language and Luganda, Ugandan English and South African English as source languages. We show that the lowest word error rates are achieved by using this model to label further untranscribed target language data and then developing SGMM acoustic model from the extended dataset. The performance of an ASR system trained in this way is sufficient for keyword detection that yields useful and actionable near real-time information to developmental organisations.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Thomas Niesler,et al.  Radio-browsing for developmental monitoring in Uganda , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Aren Jansen,et al.  Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model , 2015, INTERSPEECH.

[6]  Brian Kingsbury,et al.  Multilingual representations for low resource speech recognition and keyword search , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[7]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[8]  Dirk Van Compernolle,et al.  A study of rank-constrained multilingual DNNS for low-resource ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Wonkyum Lee,et al.  Semi-supervised training in low-resource ASR and KWS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[11]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[14]  Sanjeev Khudanpur,et al.  Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Alan W. Black,et al.  Bootstrapping Text-to-Speech for speech processing in languages without an orthography , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[18]  Richard C. Rose,et al.  Multi-lingual speech recognition with low-rank multi-task deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[21]  Thomas Niesler,et al.  Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system , 2014, Comput. Speech Lang..

[22]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.