Mandarin-English bilingual Speech Recognition for real world music retrieval

This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual speech recognition system (MESRS) for real world music retrieval. In order to balance the performance and the complexity of the bilingual SR system, an unified single set of bilingual acoustic models derived by phone clustering is developed. A novel two-pass phone clustering method based on confusion matrix (TCM) is presented and compared with the log-likelihood measure method. In order to deal with the Mandarin accent in spoken English, different non-native adaptation approaches are investigated. With the effective incorporation of approaches on phone clustering and non-native adaptation, the phrase error rate (PhrER) of MESRS for English utterances was reduced by 24.5% relatively compared to the baseline monolingual English system while the PhrER on Mandarin utterances was comparable to that of the baseline monolingual Mandarin system, and the performance for bilingual code-mixing utterances achieved 22.4% relative PhrER reduction.

[1]  Laurent Besacier,et al.  First steps in fast acoustic modeling for a new target language: application to Vietnamese , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[3]  Chafic Mokbel,et al.  Towards multilingual speech recognition using data driven source/target acoustical units association , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  James R. Glass,et al.  Lexical modeling of non-native speech for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Ocke-Schwen Bohn,et al.  The Production of New and Similar Vowels by Adult German Learners of English , 1992, Studies in Second Language Acquisition.

[6]  Bo Xu,et al.  Chinese-English bilingual phone modeling for cross-language speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Tanja Schultz,et al.  Towards universal speech recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[8]  Hui Ye,et al.  Improving the speech recognition performance of beginners in spoken conversational interaction for language learning , 2005, INTERSPEECH.

[9]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  William J. Byrne,et al.  Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  N. Poulisse,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1998 .

[12]  Yonghong Yan,et al.  Development of an approach to automatic language identification based on phone recognition , 1996, Comput. Speech Lang..

[13]  Tan Lee,et al.  Automatic speech recognition of Cantonese-English code-mixing utterances , 2006, INTERSPEECH.