论文信息 - Investigation of multilingual deep neural networks for spoken term detection

Investigation of multilingual deep neural networks for spoken term detection

The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.

[1] Hynek Hermansky,et al. Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[2] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4] Daniel P. W. Ellis,et al. Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5] Jan Cernocký,et al. Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[7] Mark J. F. Gales,et al. The efficient incorporation of MLP features into automatic speech recognition systems , 2011, Comput. Speech Lang..

[8] Pietro Laface,et al. On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[9] Haizhou Li,et al. Robust phone set mapping using decision tree clustering for cross-lingual phone recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Steve Renals,et al. Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11] Dau-Cheng Lyu,et al. Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Tanja Schultz,et al. Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[13] Tanja Schultz,et al. Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[14] Geoffrey Zweig,et al. fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15] Martin Karafiát,et al. Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16] Ralf Schlüter,et al. Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18] Georg Heigold,et al. Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Steve Young,et al. The HTK book version 3.4 , 2006 .

[20] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[21] Andreas Stolcke,et al. Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[23] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[24] Brian Kingsbury,et al. Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] L. Finch. A hybrid approach , 1998 .

[26] Philip N. Garner,et al. Current trends in multilingual speech processing , 2011 .

[27] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] S. Dupont,et al. Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[29] Xiaodong Cui,et al. A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] Hervé Bourlard,et al. Towards mixed language speech recognition systems , 2010, INTERSPEECH.

[31] William J. Byrne,et al. Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[32] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[33] Hynek Hermansky,et al. Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .