Building an ASR System for Mboshi Using A Cross-Language Definition of Acoustic Units Approach

For many languages in the world, not enough (annotated) speech data is available to train an ASR system. Recently, we proposed a cross-language method for training an ASR system using linguistic knowledge and semi-supervised training. Here, we apply this approach to the low-resource language Mboshi. Using an ASR system trained on Dutch, Mboshi acoustic units were first created using cross-language initialization of the phoneme vectors in the output layer. Subsequently, this adapted system was retrained using Mboshi self-labels. Two training methods were investigated: retraining of only the output layer and retraining the full deep neural network (DNN). The resulting Mboshi system was analyzed by investigating per phoneme accuracies, phoneme confusions, and by visualizing the hidden layers of the DNNs prior to and following retraining with the self-labels. Results showed a fairly similar performance for the two training methods but a better phoneme representation for the fully retrained DNN.

[1]  Najim Dehak I-Vector Representation Based on GMM and DNN for Audio Classification , 2016, Odyssey.

[2]  Mark Hasegawa-Johnson,et al.  Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition , 2012 .

[3]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  A. Black,et al.  Building an ASR System for a Low-resource Language Through the Adaptation of a High-resource Language ASR System: Preliminary Results , 2017 .

[5]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[6]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[7]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[8]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[9]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[10]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Lori Lamel,et al.  Developing an Embosi (Bantu C25) Speech Variant Dictionary to Model Vowel Elision and Morpheme Deletion , 2017, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[14]  Mark Hasegawa-Johnson,et al.  Visualizing Phoneme Category Adaptation in Deep Neural Networks , 2018, INTERSPEECH.

[15]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Tsuyoshi Murata,et al.  {m , 1934, ACML.