Efficient Acoustic Modeling Method for Unsupervised Speech Recognition using Multi-Task Deep Neural Network

This paper proposes a method of acoustic modeling for zero-resourced languages speech recognition under mismatch conditions. In those languages, very limited or no transcribed speech is available for traditional monolingual speech recognition. Conventional methods such as IPA based universal acoustic modeling has been proved to be effective under matched acoustic conditions (similar speaking styles, adjacent languages, etc.), while usually poorly preformed when mismatch occurs. Since mismatch problems between languages often appears, in this paper, unsupervised acoustic modeling via cross-lingual knowledge sharing has thus been proposed: first, initial acoustic models (AM) for a target zero-resourced language are trained using Multi-Task Deep Neural Networks (MDNN) – different languages’ speech mapped to the phonemes of the target language (mapped data) is jointly trained together with the same data transcribed language specifically and respectively (specific data); then, automatically transcribed target language data is used in the iterative process to train new AMs, with various auxiliary tasks. Experiment on 100 hour Japanese speech without transcripts achieved a character error rate (CER) of 57.21%, 19.32% absolute improvement compared to baseline (IPA based universal acoustic modeling).

[1]  Ngoc Thang Vu,et al.  Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[3]  Peter Bell,et al.  Complementary tasks for context-dependent deep neural network acoustic models , 2015, INTERSPEECH.

[4]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Haihua Xu,et al.  Multi-softmax deep neural network for semi-supervised training , 2015, INTERSPEECH.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  Ngoc Thang Vu,et al.  Multilingual a-stabil: A new confidence score for multilingual unsupervised training , 2010, 2010 IEEE Spoken Language Technology Workshop.

[9]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[11]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[15]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Chiori Hori,et al.  Efficient multi-lingual unsupervised acoustic model training under mismatch conditions , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).