Integration of multiple acoustic and language models for improved Hindi speech recognition system

Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment.

[1]  Ashum Gupta,et al.  Exploring Word Recognition in a Semi-Alphabetic Script: The Case of Devanagari , 2002, Brain and Language.

[2]  Mark J. F. Gales,et al.  Broadcast news transcription using HTK , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  Kiyohiro Shikano,et al.  Modularity and scaling in large phonemic neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Hermann Ney,et al.  Using multiple acoustic feature sets for speech recognition , 2007, Speech Commun..

[6]  Hugo Van hamme,et al.  Discriminative model combination and language model selection in a reading tutor for children , 2008, INTERSPEECH.

[7]  S. Saraswathi,et al.  Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system , 2007, TALIP.

[8]  Mayank Dave,et al.  Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I) , 2011, Int. J. Speech Technol..

[9]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[10]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Vassilios Digalakis,et al.  Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[13]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[14]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[15]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[16]  Mokhtar Sellami,et al.  Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition , 2008, Pattern Recognit. Lett..

[17]  Mayank Dave,et al.  Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II) , 2011, Int. J. Speech Technol..

[18]  Gernot A. Fink,et al.  Conversational speech recognition using acoustic and articulatory input , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Michael Picheny,et al.  Large-Vocabulary Speech Recognition Algorithms , 2002, Computer.

[20]  Andrew C. Morris,et al.  Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR , 2005, Comput. Speech Lang..

[21]  K. Sreenivasa Rao,et al.  Application of prosody models for developing speech systems in Indian languages , 2011, Int. J. Speech Technol..

[22]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[23]  Jeff A. Bilmes,et al.  COMBINATION AND JOINT TRAINING OF ACOUSTIC CLASSIFIERS FOR SPEECH RECOGNITION , 2000 .

[24]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[25]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[26]  Bernie Mulgrew,et al.  Proceedings IEEE International Conference on Acoustics Speech and Signal Processing , 1991 .

[27]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[28]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[29]  Hervé Bourlard,et al.  Neural nets and hidden Markov models: Review and generalizations , 1991, Speech Commun..

[30]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[31]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[32]  Claudio Becchetti,et al.  Speech Recognition: Theory and C++ Implementation , 1999 .

[33]  B. Yegnanarayana,et al.  Word boundary hypothesization in Hindi speech , 1991 .

[34]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Rong Zhang,et al.  Investigations of issues for using multiple acoustic models to improve continuous speech recognition , 2006, INTERSPEECH.

[36]  Steve Young,et al.  The HTK book , 1995 .

[37]  Steve Renals,et al.  Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Roger K. Moore Computer Speech and Language , 1986 .

[39]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Ananth Sankar Bayesian model combination (BAYCOM) for improved recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[41]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..