Audio-Visual Speech Processing for Human Computer Interaction

This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches.

[1]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[2]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Zheng Jiming,et al.  Modified Local Discriminant Bases and Its Application in Audio Feature Extraction , 2009, 2009 International Forum on Information Technology and Applications.

[4]  Yun Fu,et al.  Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis , 2008, IEEE Transactions on Multimedia.

[5]  Pedro Gómez Vilda,et al.  An improved watershed algorithm based on efficient computation of shortest paths , 2007, Pattern Recognit..

[6]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[7]  Alice Caplier,et al.  New color transformation for lips segmentation , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[8]  Karthikeyan Umapathy,et al.  Audio Signal Feature Extraction and Classification Using Local Discriminant Bases , 2004, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  A. Murat Tekalp,et al.  Projective Kalman Filter: Multiocular Tracking of 3D Locations Towards Scene Understanding , 2005, MLMI.

[10]  Xin Peng,et al.  Spectra Analysis of Sampling and Reconstructing Continuous Signal Using Hamming Window Function , 2008, 2008 Fourth International Conference on Natural Computation.

[11]  Ioannis Pitas,et al.  Application of support vector machines classifiers to visual speech recognition , 2002, Proceedings. International Conference on Image Processing.

[12]  L.M. Ang,et al.  Adaptive RBF Neural Network Training Algorithm For Nonlinear And Nonstationary Signal , 2006, 2006 International Conference on Computational Intelligence and Security.

[13]  Jwu-Sheng Hu,et al.  A Robust Speech Enhancement System for Vehicular Applications Using H/spl infin/ Adaptive Filtering , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[14]  Gert Cauwenberghs,et al.  Robust Speech Feature Extraction by Growth Transformation in Reproducing Kernel Hilbert Space , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[16]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[17]  Josef Kittler,et al.  Audio- and Video-Based Biometric Person Authentication, 5th International Conference, AVBPA 2005, Hilton Rye Town, NY, USA, July 20-22, 2005, Proceedings , 2005, AVBPA.

[18]  Peng Liu,et al.  Visual information assisted Mandarin large vocabulary continuous speech recognition , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[19]  B.N. Araabi,et al.  Kalman filter tracking for facial expression recognition using noticeable feature selection , 2007, 2007 International Conference on Intelligent and Advanced Systems.

[20]  Tomaso Poggio,et al.  Synthesizing a color algorithm from examples , 1988 .

[21]  S. Seyedin,et al.  Feature extraction based on DCT and MVDR spectral estimation for robust speech recognition , 2008, 2008 9th International Conference on Signal Processing.

[22]  Jian Zhang,et al.  Analysis of lip geometric features for audio-visual speech recognition , 2004, IEEE Trans. Syst. Man Cybern. Part A.

[23]  Jean-Philippe Thiran,et al.  Audio-visual speech recognition with a hybrid SVM-HMM system , 2005, 2005 13th European Signal Processing Conference.

[24]  Philip Denbigh,et al.  System analysis and signal processing: with emphasis on the use of MATLAB , 1998 .

[25]  Gerasimos Potamianos,et al.  Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[27]  A. V. Nefian,et al.  Bayesian networks in multimodal speech recognition and speaker identification , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[28]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[30]  Ara V. Nefian,et al.  A Bayesian Approach to Audio-Visual Speaker Identification , 2003, AVBPA.

[31]  Martin Heckmann,et al.  A hybrid ANN/HMM audio-visual speech recognition system , 2001, AVSP.

[32]  Horst Bunke,et al.  Hidden Markov models: applications in computer vision , 2001 .

[33]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Dinesh Kant Kumar,et al.  Visual Speech Recognition Using Motion Features and Hidden Markov Models , 2007, CAIP.

[35]  Jianmin Jiang,et al.  Image spatial transformation in DCT domain , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[36]  Jian-Da Wu,et al.  Speaker identification using discrete wavelet packet transform technique with irregular decomposition , 2009, Expert Syst. Appl..

[37]  Dinesh Kant Kumar,et al.  Visual recognition of speech consonants using facial movement features , 2007, Integr. Comput. Aided Eng..

[38]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Matti Pietikäinen,et al.  Local spatiotemporal descriptors for visual recognition of spoken phrases , 2007, HCM '07.

[40]  Bhaskar D. Rao,et al.  Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[42]  Trevor Darrell,et al.  Articulatory features for robust visual speech recognition , 2004, ICMI '04.

[43]  Feng Wu,et al.  Lifting-Based Directional DCT-Like Transform for Image Coding , 2007, IEEE Trans. Circuits Syst. Video Technol..

[44]  Zhuo Fang,et al.  Use Hamming window for detection the harmonic current based on instantaneous reactive power theory , 2004, The 4th International Power Electronics and Motion Control Conference, 2004. IPEMC 2004..

[45]  Mihai Gurban,et al.  Multimodal feature extraction and fusion for audio-visual speech recognition , 2009 .

[46]  D. Simon Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches , 2006 .

[47]  Ling Guan,et al.  Toward natural and efficient human computer interaction , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[48]  Stephen J. Cox,et al.  Lip reading from scale-space measurements , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Lei Xie,et al.  Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition , 2005, ICMLC.

[50]  Juan Ruiz-Alzola,et al.  A fuzzy-controlled Kalman filter applied to stereo-visual tracking schemes , 2003, Signal Process..

[51]  Rong Chen,et al.  A PCA Based Visual DCT Feature Extraction Method for Lip-Reading , 2006, 2006 International Conference on Intelligent Information Hiding and Multimedia.

[52]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[53]  Xuemin Shen,et al.  Game theory approach to discrete H∞ filter design , 1997, IEEE Trans. Signal Process..

[54]  Alan Wee-Chung Liew,et al.  Segmentation of color lip images by spatial fuzzy clustering , 2003, IEEE Trans. Fuzzy Syst..

[55]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[56]  Buket D. Barkana,et al.  Voiced/Unvoiced Decision for Speech Signals Based on Zero-Crossing Rate and Energy , 2008, SCSS.

[57]  Lionel Delphin-Poulat Robust speech recognition techniques evaluation for telephony server based in-car applications , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  A. Murat Tekalp,et al.  Multimodal speaker/speech recognition using lip motion, lip texture and audio , 2006, Signal Process..

[59]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[60]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  Kiyohiro Shikano,et al.  Gaussian mixture selection using context-independent HMM , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[62]  Horst Bunke,et al.  Sentence Lipreading Using Hidden Markov Model with Integrated Grammar , 2001, Int. J. Pattern Recognit. Artif. Intell..

[63]  Hiroshi G. Okuno,et al.  Automatic speech recognition improved by two-layered audio-visual integration for robot audition , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[64]  P. Anno,et al.  Spectral decomposition of seismic data with continuous-wavelet transform , 2005 .

[65]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Satoshi Nakamura,et al.  An adaptive integration based on product hmm for audio-visual speech recognition , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[67]  Sridha Sridharan,et al.  Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition , 2007, INTERSPEECH.