Speech recognition of under-resourced languages using mismatched transcriptions

Mismatched crowdsourcing is a technique to derive speech transcriptions using crowd-workers unfamiliar with the language being spoken. This technique is especially useful for under-resourced languages since it is hard to hire native transcribers. In this paper, we demonstrate that using mismatched transcription for adaptation improves performance of speech recognition under limited matched training data conditions. In addition, we show that using data augmentation improves not only performance of monolingual system but also makes mismatched transcription adaptation more effective.

[1]  Mark Hasegawa-Johnson,et al.  Transcribing continuous speech using mismatched crowdsourcing , 2015, INTERSPEECH.

[2]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Mark Hasegawa-Johnson,et al.  An Investigation on Training Deep Neural Networks Using Probabilistic Transcriptions , 2016, INTERSPEECH.

[4]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[5]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[6]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Tanja Schultz,et al.  Experiments on cross-language acoustic modeling , 2001, INTERSPEECH.

[10]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Hervé Bourlard,et al.  Using KL-divergence and multilingual information to improve ASR for under-resourced languages , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Mark Hasegawa-Johnson,et al.  Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages , 2016, INTERSPEECH.

[14]  Haizhou Li,et al.  Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages , 2014, IEICE Trans. Inf. Syst..

[15]  Haizhou Li,et al.  Kernel density-based acoustic model with cross-lingual bottleneck features for resource limited LVCSR , 2014, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Exemplar-Based Processing for Speech Recognition: An Overview , 2012, IEEE Signal Processing Magazine.

[17]  Mark Hasegawa-Johnson,et al.  Acquiring Speech Transcriptions Using Mismatched Crowdsourcing , 2015, AAAI.

[18]  Haizhou Li,et al.  Context-dependent phone mapping for LVCSR of under-resourced languages , 2013, INTERSPEECH.

[19]  Haizhou Li,et al.  Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition , 2008, INTERSPEECH.

[20]  2016 International Conference on Asian Language Processing, IALP 2016, Tainan, Taiwan, November 21-23, 2016 , 2016, IALP.

[21]  Mark Hasegawa-Johnson,et al.  Adapting ASR for under-resourced languages using mismatched transcriptions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ngoc Thang Vu,et al.  Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Mark Hasegawa-Johnson,et al.  Mismatched Crowdsourcing based Language Perception for Under-resourced Languages , 2016, SLTU.