Multilingual hierarchical MRASTA features for ASR

Recently, a multilingual Multi Layer Perceptron (MLP) training method was introduced without having to explicitly map the phonetic units of multiple languages to a common set. This paper further investigates this method using bottleneck (BN) tandem connectionist acoustic modeling for four high-resourced languages — English, French, German, and Polish. Aiming at the improvement of already existing high performing automatic speech recognition (ASR) systems, the multilingual training of the BN-MLP is extended from short-term to hierarchical longterm (multi-resolutional RASTA) feature extraction. Furthermore, deeper structures and context-dependent target labels are also examined. We experimentally demonstrate that a single state-of-the-art BN feature set can be trained for multiple languages, which is superior to the monolingual feature set, and results in significant gains in all the four languages. Studying the scalability of the multilingual BN features, a similar gain is observed in small (50 hours) and in larger scale (300 hours) ASR experiments regardless of the distribution of the data amount between the languages. Using deeper structures, context-dependent targets, and speaker adaptation, the multilingual BN reduces the word error rates by 3‐7% relative over the target language BN features and 25‐30% over the conventional MFCC system. Index Terms: deep MLP, bottleneck, multilingual, hierarchical, MRASTA, LVCSR

[1]  Hermann Ney,et al.  Deep hierarchical bottleneck MRASTA features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Hermann Ney,et al.  Hierarchical bottle neck features for LVCSR , 2010, INTERSPEECH.

[3]  Hermann Ney,et al.  Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR , 2011, INTERSPEECH.

[4]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[5]  Hermann Ney,et al.  Using morpheme and syllable based sub-words for polish LVCSR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7]  Alex Waibel,et al.  Development of Multilingual Acoustic Models in the GlobalPhone Project , 1998 .

[8]  William J. Byrne,et al.  Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[10]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[11]  Fabio Brugnara,et al.  Adaptive training using simple target models [speech recognition applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Hermann Ney,et al.  The RWTH 2010 Quaero ASR evaluation system for English, French, and German , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ngoc Thang Vu,et al.  An Investigation on Initialization Schemes for Multilayer Perceptron Training Using Multilingual Dat , 2012 .

[17]  ˇ Boˇ Study of Probabilistic and Bottle-Neck Features in Multilingual Environment , 2011 .

[18]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Hermann Ney,et al.  RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit , 2011 .

[20]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Hermann Ney,et al.  The RWTH 2009 quaero ASR evaluation system for English and German , 2010, INTERSPEECH.

[22]  Hynek Hermansky,et al.  Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[23]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Simon King,et al.  Monolingual and crosslingual comparison of tandem features derived from articulatory and phone MLPS , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[25]  Simon King,et al.  Cross-lingual portability of MLP-based tandem features - a case study for English and Hungarian , 2008, INTERSPEECH.

[26]  Hervé Bourlard,et al.  Towards mixed language speech recognition systems , 2010, INTERSPEECH.

[27]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[29]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Hermann Ney,et al.  Cross-lingual portability of Chinese and english neural network features for French and German LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[31]  Martin Karafiát,et al.  Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[32]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[33]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.