Large-Scale Language Modeling with Random Forests for Mandarin Chinese Speech-to-Text

In this work the random forest language modeling approach is applied with the aim of improving the performance of the LIMSI, highly competitive, Mandarin Chinese speech-to-text system. The experimental setup is that of the GALE Phase 4 evaluation. This setup is characterized by a large amount of available language model training data (over 3.2 billion segmented words). A conventional unpruned 4-gram language model with a vocabulary of 56K words serves as a baseline that is challenging to improve upon. However moderate perplexity and CER improvements over this model were obtained with a random forest language model. Different random forest training strategies were explored so as to attain the maximal gain in performance and Forest of Random Forest language modeling scheme is introduced.

[1]  Yi Su,et al.  Knowledge integration into language models: a random forest approach , 2009 .

[2]  Lukás Burget,et al.  Morphological random forests for language modeling of inflectional languages , 2008, 2008 IEEE Spoken Language Technology Workshop.

[3]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[4]  Mark J. F. Gales,et al.  Exploiting Chinese character models to improve speech recognition performance , 2009, INTERSPEECH.

[5]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Pascale Fung,et al.  Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition , 1994, ANLP.

[7]  Jun Luo,et al.  Modeling characters versuswords for mandarin speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Peng Xu,et al.  Random Forests in Language Modelin , 2004, EMNLP.

[9]  Jean-Luc Gauvain,et al.  Improved acoustic modeling for transcribing Arabic broadcast data , 2007, INTERSPEECH.

[10]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Yi Su,et al.  Large-scale random forest language models for speech recognition , 2007, INTERSPEECH.

[12]  Qin Jin,et al.  Phonetic speaker recognition using maximum-likelihood binary-decision tree models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[14]  Peng Xu,et al.  Random forests and the data sparseness problem in language modeling , 2007, Comput. Speech Lang..

[15]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[16]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[17]  Jean-Luc Gauvain,et al.  MODELING CHARACTERS VERSUS WORDS FOR MANDARIN SPEECH RECOGNITION , 2009 .