Acoustic Model and Language Model Adaptation for a Mobile Dictation Service Acoustic Model and Language Model Adaptation for a Mobile Dictation Service

Automatic speech recognition is the machine-based method of converting speech to text. MobiDic is a mobile dictation service which uses a server-side speech recognition system to convert speech recorded on a mobile phone to readable and editable text notes. In this work, performance of the TKK speech recognition system has been evaluated on law-related speech recorded on a mobile phone with the Mobi-Dic client application. There was mismatch betweeen testing and training data in terms of both of acoustics and language. The background acoustic models were trained on speech recorded on PC microphones. The background language models were trained on texts from journals and news wire services. Because of the special nature of the testing data, main focus has been on using acoustic model and language model adaptation methods to enhance speech recognition performance. Acoustic model adaptation gives the highest and most reliable performance increase. Using the global cMLLR method, word error rate reductions between 15-22% can be reached with only 2 minutes of adaptation data. Regression class cMLLR can give even higher performance boosts if larger sets of audio adaptation data (> 10 min) are available. Language model adaptation was not able to significantly improve performance in this task. The main problems were differences between language adaptation data and language of the law-related speech data.

[1]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[2]  Janne Pylkkönen AN EFFICIENT ONE-PASS DECODER FOR FINNISH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , .

[3]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[4]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[5]  James R. Hopgood,et al.  Nonconcurrent multiple speakers tracking based on extended Kalman particle filter , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Lou Boves,et al.  Language and speech technology , 1998 .

[9]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[10]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[11]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[12]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[15]  Markku Turunen,et al.  Mobidic - a mobile dictation and notetaking application , 2008, INTERSPEECH.

[16]  Jean-Luc Gauvain,et al.  Unsupervised language model adaptation for broadcast news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Jean-Luc Gauvain,et al.  Dynamic language modeling for broadcast news , 2004, INTERSPEECH.

[18]  Gökhan Tür,et al.  Exploiting user feedback for language model adaptation in meeting recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[20]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[21]  Mehmet Gonullu,et al.  Department of Computer Science and Engineering , 2011 .

[22]  Zheng-Hua Tan,et al.  Network, Distributed and Embedded Speech Recognition: An Overview , 2008 .

[23]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[24]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[25]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[26]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[27]  Janne Pylkkönen New pruning criteria for efficient decoding , 2005, INTERSPEECH.

[28]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[29]  Dragisa Miskovic,et al.  A decoder for large vocabulary speech recognition , 2011, 2011 18th International Conference on Systems, Signals and Image Processing.

[30]  Philip C. Woodland,et al.  An investigation into vocal tract length normalisation , 1999, EUROSPEECH.

[31]  Marcello Federico,et al.  Efficient language model adaptation through MDI estimation , 1999, EUROSPEECH.

[32]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[33]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[35]  Salim Roukos,et al.  Language model adaptation via minimum discrimination information , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[36]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[37]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[38]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[39]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[40]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[41]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[42]  Jean-Claude Junqua Robust Speech Recognition in Embedded Systems and PC Applications , 2000 .

[43]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[44]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[45]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[46]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[47]  Mari Ostendorf,et al.  Analyzing and predicting language model improvements , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.