Map and Relabel: Towards Almost-Zero Resource Speech Recognition

Modern automatic speech recognition (ASR) systems require large amounts of data to train the acoustic model, especially with the state-of-the-art deep neural network (DNN) architecture. Unfortunately, most of the languages in the world have very limited accumulating for data resources, limiting the application of ASR technologies in these languages. The state-of-the-art approach to tackle this problem is transfer learning, by which DNNs trained with data of a rich-resource language can be reused by low-resource language systems, in the form of either feature extractor or initial model. This approach, however, still requires several hours of speech, which is still not affordable for many languages. In this study, we present a novel Map and Relabel (MaR) approach that can train ASR systems for new languages with only a few hundred labelled utterances. This approach combines transfer learning and semi-supervised learning in a boosting manner: it firstly trains a simple monophone DNN based on the limited training data, employing the popular transfer learning approach (Map phase); this model is then used to produce pseudo phone labels for a large amount of untranscribed speech (Relabel phase). These pseudo-labelled data are then used to train a full-fledged tri-phone system. Experiments on Uyghur, a major minority language in the western China, demonstrates that this MaR approach is rather successful: it can train a pretty good ASR Uyghur system by only 500 utterances. This encouraging results indicate that it is possible to quickly construct a reasonable ASR system for any language, and the only effort we need to pay is just labelling several hundred utterances.

[1]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[2]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Steve Renals,et al.  Multi-level adaptive networks in tandem and hybrid ASR systems , 2013, ICASSP.

[4]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[5]  Florian Metze,et al.  DNN acoustic modeling with modular multi-lingual feature extraction networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[9]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Thomas Fang Zheng,et al.  M2ASR: Ambitions and first year progress , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[11]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Julie S. Amberg,et al.  Introduction: What is language? , 2009 .

[13]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.