Multi-Task Learning Using Mismatched Transcription for Under-Resourced Speech Recognition

It is challenging to obtain large amounts of native (matched) labels for audio in under-resourced languages. This could be due to a lack of literate speakers of the language or a lack of universally acknowledged orthography. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the language (in place of native speakers), to transcribe what they hear as nonsense speech in their own language (e.g., Mandarin). This paper presents a multi-task learning framework where the DNN acoustic model is simultaneously trained using both a limited amount of native (matched) transcription and a larger set of mismatched transcription. We find that by using a multi-task learning framework, we achieve improvements over monolingual baselines and previously proposed mismatched transcription adaptation techniques. In addition, we show that using alignments provided by a GMM adapted by mismatched transcription further improves acoustic modeling performance. Our experiments on Georgian data from the IARPA Babel program show the effectiveness of the proposed method.

[1]  Hervé Bourlard,et al.  Using KL-divergence and multilingual information to improve ASR for under-resourced languages , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mark Hasegawa-Johnson,et al.  Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages , 2016, INTERSPEECH.

[3]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Mark Hasegawa-Johnson,et al.  Transcribing continuous speech using mismatched crowdsourcing , 2015, INTERSPEECH.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Haizhou Li,et al.  Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition , 2008, INTERSPEECH.

[7]  Abhinav Thanda,et al.  Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition , 2017, ArXiv.

[8]  Haizhou Li,et al.  Context-dependent phone mapping for LVCSR of under-resourced languages , 2013, INTERSPEECH.

[9]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[11]  Phil D. Green,et al.  Multitask learning in connectionist robust ASR using recurrent neural networks , 2003, INTERSPEECH.

[12]  Haizhou Li,et al.  Kernel density-based acoustic model with cross-lingual bottleneck features for resource limited LVCSR , 2014, INTERSPEECH.

[13]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Haizhou Li,et al.  Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages , 2014, IEICE Trans. Inf. Syst..

[15]  Liang Lu,et al.  Maximum a posteriori adaptation of subspace Gaussian mixture models for cross-lingual speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[17]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ngoc Thang Vu,et al.  Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Mark Hasegawa-Johnson,et al.  Adapting ASR for under-resourced languages using mismatched transcriptions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Mark Hasegawa-Johnson,et al.  An Investigation on Training Deep Neural Networks Using Probabilistic Transcriptions , 2016, INTERSPEECH.

[22]  Mark Hasegawa-Johnson,et al.  Speech recognition of under-resourced languages using mismatched transcriptions , 2016, 2016 International Conference on Asian Language Processing (IALP).

[23]  Mark Hasegawa-Johnson,et al.  Acquiring Speech Transcriptions Using Mismatched Crowdsourcing , 2015, AAAI.

[24]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[25]  Tara N. Sainath,et al.  Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[26]  Partha Lal,et al.  Cross-Lingual Automatic Speech Recognition Using Tandem Features , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[28]  Tanja Schultz,et al.  Experiments on cross-language acoustic modeling , 2001, INTERSPEECH.

[29]  Petr Motlícek,et al.  Using out-of-language data to improve an under-resourced speech recognizer , 2014, Speech Communication.