Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition

In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT-DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to -10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.

[1]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[2]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[3]  Omar Farooq,et al.  Hindi viseme recognition using subspace DCT features , 2014, Int. J. Appl. Pattern Recognit..

[4]  Christian Hacker,et al.  Revising Perceptual Linear Prediction (PLP) , 2005, INTERSPEECH.

[5]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[6]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[7]  Achyuta Nand Mishra,et al.  Robust Features for Connected Hindi Digits Recognition , 2011 .

[8]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[9]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Jing Huang,et al.  Audio-visual speech recognition using an infrared headset , 2004, Speech Commun..

[11]  Chalapathy Neti,et al.  Audio-visual speech recognition in challenging environments , 2003, INTERSPEECH.

[12]  Jiang Li,et al.  Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction , 2002, IEEE Trans. Geosci. Remote. Sens..

[13]  Vijayan K. Asari,et al.  Face detection technique based on rotation invariant wavelet features , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[14]  S. Lokesh,et al.  Robust Speech Feature Prediction Using Mel-LPC to Improve Recognition Accuracy , 2012 .

[16]  Omar Farooq,et al.  A comparison of visual features for audiovisual automatic speech recognition , 2008 .

[17]  Jean-Philippe Thiran,et al.  Low-dimensional motion features for audio-visual speech recognition , 2007, 2007 15th European Signal Processing Conference.

[18]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Kuldip K. Paliwal,et al.  On the Use of Speech and Face Information for Identity Verification , 2004 .

[20]  Navnath S. Nehe,et al.  DWT and LPC based feature extraction methods for isolated word recognition , 2012, EURASIP Journal on Audio, Speech, and Music Processing.

[21]  Dorothea Kolossa,et al.  Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Ashish Verma,et al.  A large-vocabulary continuous speech recognition system for Hindi , 2004, IBM J. Res. Dev..

[23]  Omar Farooq,et al.  Performance Evaluation of Bimodal Hindi Speech Recognition under Adverse Environment , 2014, FICTA.

[24]  Omar Farooq,et al.  Enhancement of VSR using low dimension visual feature , 2013, IMPACT-2013.

[25]  Simon Lucey,et al.  Audio-visual Speech Processing , 2002 .

[26]  John H. L. Hansen,et al.  Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust speech recognition in real car environments , 2010, Speech Commun..

[27]  S. R. Mahadeva Prasanna,et al.  Speaker verification in sensor and acoustic environment mismatch conditions , 2012, Int. J. Speech Technol..

[28]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[29]  Sébastien Marcel,et al.  Comparison of MLP and GMM Classifiers for Face Verification on XM2VTS , 2003, AVBPA.

[30]  Omar Farooq,et al.  Wavelet Sub-Band Based Temporal Features for Robust Hindi Phoneme Recognition , 2010, Int. J. Wavelets Multiresolution Inf. Process..

[31]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[32]  Steve Young,et al.  HMMs and related speech recognition technologies , 2008 .