A large vocabulary continuous speech recognition system for Persian language

The first large vocabulary speech recognition system for the Persian language is introduced in this paper. This continuous speech recognition system uses most standard and state-of-the-art speech and language modeling techniques. The development of the system, called Nevisa, has been started in 2003 with a dominant academic theme. This engine incorporates customized established components of traditional continuous speech recognizers and its parameters have been optimized for real applications of the Persian language. For this purpose, we had to identify the computational challenges of the Persian language, especially for text processing and extract statistical and grammatical language models for the Persian language. To achieve this, we had to either generate the necessary speech and text corpora or modify the available primitive corpora available for the Persian language.In the proposed system, acoustic modeling is based on hidden Markov models, and optimized decoding, pruning and language modeling techniques were used in the system. Both statistical and grammatical language models were incorporated in the system. MFCC representation with some modifications was used as the speech signal feature. In addition, a VAD was designed and implemented based on signal energy and zero-crossing rate. Nevisa is equipped with out-of-vocabulary capability for applications with medium or small vocabulary sizes. Powerful robustness techniques were also utilized in the system. Model-based approaches like PMC, MLLR and MAP, along with feature robustness methods such as CMS, PCA, RCC and VTLN, and speech enhancement methods like spectral subtraction and Wiener filtering, along with their modified versions, were diligently implemented and evaluated in the system. A new robustness method called PC-PMC was also proposed and incorporated in the system. To evaluate the performance and optimize the parameters of the system in noisy-environment tasks, four real noisy speech data sets were generated. The final performance of Nevisa in noisy environments is similar to the clean conditions, thanks to the various robustness methods implemented in the system. Overall recognition performance of the system in clean and noisy conditions assures us that the system is a real-world product as well as a competitive ASR engine.

[1]  Mei-Yuh Hwang,et al.  Senones, multi-pass search, and unified stochastic modeling in sphinx-II , 1993, EUROSPEECH.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[4]  Hermann Ney,et al.  Improved lexical tree search for large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[6]  S. M. Ahadi,et al.  Recognition of continuous persian speech using a medium-sized vocabulary speech corpus , 1999, EUROSPEECH.

[7]  Hadi Veisi,et al.  The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition , 2011, Digit. Signal Process..

[8]  Salwani Abdullah,et al.  Great Deluge Algorithm for Rough Set Attribute Reduction , 2010, FGIT-DTA/BSBT.

[9]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Mahmood Bijankhan,et al.  Lessons from building a Persian written corpus: Peykare , 2011, Lang. Resour. Evaluation.

[11]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[12]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[13]  Mansour Sheikhan,et al.  Continuous speech recognition and syntactic processing in Iranian Farsi language , 1997, Int. J. Speech Technol..

[14]  Andrew Radford,et al.  Transformational Grammar: A First Course , 1988 .

[15]  Hermann Ney,et al.  Speaker adaptive modeling by vocal tract normalization , 2002, IEEE Trans. Speech Audio Process..

[16]  Steve J. Young,et al.  The use of state tying in continuous speech recognition , 1993, EUROSPEECH.

[17]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Philip C. Woodland,et al.  Speaker adaptation: techniques and challenges , 1999 .

[19]  Hossein Sameti,et al.  A computational grammar for Persian based on GPSG , 2011, Lang. Resour. Evaluation.

[20]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[21]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[22]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Hadi Veisi,et al.  Nevisa, a Persian Continuous Speech Recognition System , 2008 .

[24]  Hadi Veisi,et al.  The Combination of CMS with PMC for Improving Robustness of Speech Recognition Systems , 2008 .

[25]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[26]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[27]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..

[28]  Andrew Radford,et al.  Transformational Grammar: Contents , 1988 .

[29]  MOHAMMAD BAHRANI,et al.  Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus , 2011, Int. J. Comput. Process. Orient. Lang..

[30]  John T. Maxwell,et al.  Formal issues in lexical-functional grammar , 1998 .

[31]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[32]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[33]  M Bijankhan,et al.  FARSDAT- THE SPEECH DATABASE OF FARSI SPOKEN LANGUAGE , 1994 .

[34]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[35]  Alexander H. Waibel,et al.  Speaker normalization and speaker adaptation - a combination for conversational speech recognition , 1997, EUROSPEECH.

[36]  H. Veisi,et al.  Improving the Robustness of Persian Large Vocabulary Continuous Speech Recognition System for Real Applications , 2006, 2006 2nd International Conference on Information & Communication Technologies.

[37]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[38]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[39]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[40]  Simon King,et al.  IEEE Workshop on automatic speech recognition and understanding , 2009 .

[41]  Ronald M. Kaplan,et al.  The Formal Architecture of Lexical-Functional Grammar , 1989, J. Inf. Sci. Eng..

[42]  Shrikanth S. Narayanan,et al.  Language-adaptive persian speech recognition , 2003, INTERSPEECH.

[43]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[44]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[45]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .