Robust Arabic Multi-stream Speech Recognition System in Noisy Environment

In this paper, the framework of multi-stream combination has been explored to improve the noise robustness of automatic speech recognition systems. The main important issues of multi-stream systems are which features representation to combine and what importance (weights) be given to each one. Two stream features have been investigated, namely the MFCC features and a set of complementary features which consists of pitch frequency, energy and the first three formants. Empiric optimum weights are fixed for each stream. The multi-stream vectors are modeled by Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) state distributions. Our ASR is implemented using HTK toolkit and ARADIGIT corpus which is data base of Arabic spoken words. The obtained results show that for highly noisy speech, the proposed multi-stream vectors leads to a significant improvement in recognition accuracy.

[1]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Douglas D. O'Shaughnessy,et al.  Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Sabri Gurbuz,et al.  Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[5]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[6]  Guoyun Lv,et al.  Multi-stream Asynchrony Modeling for Audio-Visual Speech Recognition , 2007, ISM 2007.

[7]  Xi Li,et al.  Stress and Emotion Classification using Jitter and Shimmer Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  M. Boudraa,et al.  Transform-based multi-feature optimization for robust distributed speech recognition , 2011, 2011 IEEE GCC Conference and Exhibition (GCC).

[9]  Alexandros Potamianos,et al.  Stream Weight Computation for Multi-Stream Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[11]  Hongyu Guo,et al.  A multi-stream speech recognition system based on the estimation of stream weights , 2010, 2010 3rd International Congress on Image and Signal Processing.

[12]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[15]  Alexandros Potamianos,et al.  Unsupervised Stream-Weights Computation in Classification and Recognition Tasks , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Rong Tong,et al.  Chinese Dialect Identification Using Tone Features Based on Pitch Flux , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[18]  Douglas D. O'Shaughnessy,et al.  Auditory-based Acoustic Distinctive Features and Spectral Cues for Robust Automatic Speech Recognition in Low-SNR Car Environments , 2003, HLT-NAACL.

[19]  Bo Xu,et al.  Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic and Lexical Information in Maximum Entropy Framework , 2009, 2009 Chinese Conference on Pattern Recognition.

[20]  Daniel P. W. Ellis,et al.  Multi-stream speech recognition: ready for prime time? , 1999, EUROSPEECH.

[21]  Douglas D. O'Shaughnessy,et al.  Comparative experiments to evaluate the use of auditory-based acoustic distinctive features and formant cues for robust automatic speech recognition in low-SNR car environments , 2003, INTERSPEECH.

[22]  Mathew Magimai.-Doss,et al.  Using Auxiliary Sources of Knowledge for Automatic Speech Recognition , 2005 .

[23]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[24]  H. Helms Editorial: Words of thanks , 1977 .