Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions

We examine a scenario where we have no access to native transcribers in the target language. This is typical of language communities that are under-resourced. However, turkers (online crowd workers) available in online marketplaces can serve as valuable alternative resources for providing transcripts in the target language. We assume that the turkers neither speak nor have any familiarity with the target language. Thus, they are unable to distinguish all phone pairs in the target language; their transcripts therefore specify, at best, a probability distribution called a probabilistic transcript (PT). Standard deep neural network (DNN) training using PTs do not necessarily improve error rates. Previously reported results have demonstrated some success by adopting the multi-task learning (MTL) approach. In this study, we report further improvements by introducing a deep auto-encoder based MTL. This method leverages large amounts of untranscribed data in the target language in addition to the PTs obtained from turkers. Furthermore, to encourage transfer learning in the feature space, we also examine the effect of using monophones from transcripts in well-resourced languages. We report consistent improvement in phone error rates (PER) for Swahili, Amharic, Dinka, and Mandarin.

[1]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Mark Hasegawa-Johnson,et al.  Transcribing continuous speech using mismatched crowdsourcing , 2015, INTERSPEECH.

[4]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[5]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[6]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[7]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Ngoc Thang Vu,et al.  An Investigation on Initialization Schemes for Multilayer Perceptron Training Using Multilingual Dat , 2012 .

[9]  Dong Yu,et al.  An investigation into using parallel data for far-field speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[13]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Mark Hasegawa-Johnson,et al.  Adapting ASR for under-resourced languages using mismatched transcriptions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[16]  Mark Hasegawa-Johnson,et al.  An Investigation on Training Deep Neural Networks Using Probabilistic Transcriptions , 2016, INTERSPEECH.