Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

This book presents a summary of the cognitively inspired basis behind multimodal speech enhancement, covering the relationship between audio and visual modalities in speech, as well as recent research into audiovisual speech correlation. A number of audiovisual speech filtering approaches that make use of this relationship are also discussed. A novel multimodal speech enhancement system, making use of both visual and audio information to filter speech, is presented, and this book explores the extension of this system with the use of fuzzy logic to demonstrate an initial implementation of an autonomous, adaptive, and context aware multimodal system. This work also discusses the challenges presented with regard to testing such a system, the limitations with many current audiovisual speech corpora, and discusses a suitable approach towards development of a corpus designed to test this novel, cognitively inspired, speech filtering system.

[1]  Ben P. Milner,et al.  Enhancing audio speech using visual speech features , 2009, INTERSPEECH.

[2]  Brian C J Moore,et al.  Evaluation of the noise reduction system in a commercial digital hearing aid: Evaluación del sistema de reducción de ruido en un auxiliar auditivo digital comercial , 2003, International journal of audiology.

[3]  Xinge You,et al.  A local region based approach to lip tracking , 2012, Pattern Recognit..

[4]  Aapo Hyvärinen,et al.  A Fast Fixed-Point Algorithm for Independent Component Analysis of Complex Valued Signals , 2000, Int. J. Neural Syst..

[5]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[6]  Chalapathy Neti,et al.  Noisy audio feature enhancement using audio-visual speech data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[8]  Adel El-Hennawy,et al.  Speech recognition using a wavelet transform to establish fuzzy inference system through subtractive clustering and neural network (ANFIS) , 2008, ICONS 2008.

[9]  Marios M. Polycarpou,et al.  Fuzzy Logic based Switching and Tuning Supervisor for a Multi-variable Multiple Controller , 2007, 2007 IEEE International Fuzzy Systems Conference.

[10]  Todd A. Ricketts,et al.  Making Sense of Directional Microphone Hearing Aids , 1999 .

[11]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[12]  Maurice Milgram,et al.  Multi features models for robust lip tracking , 2008, 2008 10th International Conference on Control, Automation, Robotics and Vision.

[13]  Tariq S. Durrani,et al.  A Novel Psychoacoustically Motivated Multichannel Speech Enhancement System , 2007, COST 2102 Workshop.

[14]  Francis K. Kuk,et al.  Improving hearing aid performance in noise: Challenges and strategies , 2002 .

[15]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[16]  Paul A. Lynn,et al.  Signal Processing of Speech (Macmillan New Electronics) , 1993 .

[17]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[18]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[19]  L. Girin,et al.  Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Daniel Freedman,et al.  Contour Tracking in Clutter: A Subset Approach , 2004, International Journal of Computer Vision.

[21]  Zhihong Zeng,et al.  Audio-visual affect recognition through multi-stream fused HMM for HCI , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Conrad Sanderson,et al.  Biometric Person Recognition: Face, Speech and Fusion , 2008 .

[23]  Leslie S. Smith,et al.  Robust sound onset detection using leaky integrate-and-fire neurons with depressing synapses , 2004, IEEE Transactions on Neural Networks.

[24]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[25]  Benjamin Schrauwen,et al.  An overview of reservoir computing: theory, applications and implementations , 2007, ESANN.

[26]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[27]  Ruth A Bentler,et al.  Hearing-in-Noise: comparison of listeners with normal and (aided) impaired hearing. , 2004, Journal of the American Academy of Audiology.

[28]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[29]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[30]  E. Oja,et al.  Independent Component Analysis , 2001 .

[31]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[32]  Jon Barker,et al.  Energetic and Informational Masking Effects in an Audiovisual Speech Recognition System , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  William K. Pratt,et al.  Scene Adaptive Coder , 1984, IEEE Trans. Commun..

[34]  Donald J. Schum,et al.  Noise‐reduction circuitry in hearing aids: (2) Goals and current strategies , 2003 .

[35]  Ioannis Pitas,et al.  Rule-based face detection in frontal views , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Michael Lindenbaum,et al.  Sequential Karhunen-Loeve basis extraction and its application to images , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[37]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[38]  Aytekin Bagis,et al.  Determining fuzzy membership functions with tabu search - an application to control , 2003, Fuzzy Sets Syst..

[39]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[40]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Paulo J. G. Lisboa,et al.  The Use of Artificial Neural Networks in Decision Support in Cancer: a Systematic Review , 2005 .

[43]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[44]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[45]  Anna Esposito,et al.  Designing a Fast Neuro-fuzzy System for Speech Noise Cancellation , 2000, MICAI.

[46]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[47]  Kah Kay Sung,et al.  Learning and example selection for object and pattern detection , 1995 .

[48]  E. D. Adrian,et al.  The Basis of Sensation , 1928, The Indian Medical Gazette.

[49]  Ning Ma,et al.  Recent advances in speech fragment decoding techniques , 2006, INTERSPEECH.

[50]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[51]  Jochen J. Steil,et al.  Tutorial: Perspectives on Learning with RNNs , 2002 .

[52]  S. Rosen Temporal information in speech: acoustic, auditory and linguistic aspects. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[53]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[54]  Mary T Cord,et al.  Relationship between laboratory measures of directional advantage and everyday success with directional microphone hearing aids. , 2004, Journal of the American Academy of Audiology.

[55]  King Chung,et al.  Challenges and Recent Developments in Hearing Aids: Part I. Speech Understanding in Noise, Microphone Technologies and Noise Reduction Algorithms , 2004, Trends in amplification.

[56]  Engin Avci,et al.  Speech recognition using a wavelet packet adaptive network based fuzzy inference system , 2006, Expert Syst. Appl..

[57]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications , 1949 .

[58]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[59]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[60]  Jung-Hsien Chiang,et al.  Handwritten word recognition with character and inter-character neural networks , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[61]  Dibyendu Ghoshal,et al.  Extraction of time invariant lips based on Morphological Operation and Corner Detection Method , 2012 .

[62]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[63]  Yang Lu,et al.  A geometric approach to spectral subtraction , 2008, Speech Commun..

[64]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[65]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[66]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[67]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[68]  Thomas S. Huang,et al.  Human face detection in a complex background , 1994, Pattern Recognit..

[69]  R. E. Carlson,et al.  Monotone Piecewise Cubic Interpolation , 1980 .

[70]  Yi Hu,et al.  Evaluation of objective measures for speech enhancement , 2006, INTERSPEECH.

[71]  Mary T Cord,et al.  Performance of directional microphone hearing aids in everyday life. , 2002, Journal of the American Academy of Audiology.

[72]  Wofgang Maas,et al.  Networks of spiking neurons: the third generation of neural network models , 1997 .

[73]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[74]  Francis Kuk,et al.  Performance of a fully adaptive directional microphone to signals presented from various azimuths. , 2005, Journal of the American Academy of Audiology.

[75]  Russell M. Mersereau,et al.  Lip feature extraction towards an automatic speechreading system , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[76]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[77]  Stan Z. Li,et al.  Jensen-Shannon boosting learning for object recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[78]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[79]  Wolfgang Maass,et al.  Movement Generation with Circuits of Spiking Neurons , 2005, Neural Computation.

[80]  Christian Jutten,et al.  Developing an audio-visual speech source separation algorithm , 2004, Speech Commun..

[81]  Shu Hung Leung,et al.  Automatic lip contour extraction from color images , 2004, Pattern Recognit..

[82]  Giridharan Iyengar,et al.  Robust detection of visual ROI for automatic speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[83]  Harry Shum,et al.  Statistical Learning of Multi-view Face Detection , 2002, ECCV.

[84]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[85]  Raghu Krishnapuram,et al.  A robust approach to image enhancement based on fuzzy logic , 1997, IEEE Trans. Image Process..

[86]  Jon Barker,et al.  Audio-visual speech fragment decoding , 2007, AVSP.

[87]  Saeed Bagheri Shouraki,et al.  Recognition of human speech phonemes using a novel fuzzy approach , 2007, Appl. Soft Comput..

[88]  W. Dreschler,et al.  Clinical evaluation of a full-digital in-the-ear hearing instrument. , 1999, Audiology : official organ of the International Society of Audiology.

[89]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[90]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[91]  Maurice Milgram,et al.  Semi Adaptive Appearance Models for lip tracking , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[92]  Shuichi Sakamoto,et al.  A two‐stage binaural speech enhancement approach for hearing aids with preserving binaural benefits in noisy environments , 2008 .

[93]  Amir Hussain,et al.  A novel multiple-controller incorporating a radial basis function neural network based generalized learning model , 2006, Neurocomputing.

[94]  Alan L. Yuille,et al.  Feature extraction from faces using deformable templates , 2004, International Journal of Computer Vision.

[95]  Allen R. Tannenbaum,et al.  Localizing Region-Based Active Contours , 2008, IEEE Transactions on Image Processing.

[96]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[97]  A. Murat Tekalp,et al.  Lip feature extraction based on audio-visual correlation , 2005, 2005 13th European Signal Processing Conference.

[98]  John R. Hershey,et al.  Audio-Visual Sound Separation Via Hidden Markov Models , 2001, NIPS.

[99]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[100]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[101]  Kazuo Tanaka,et al.  Switching control of an R/C hovercraft: stabilization and smooth switching , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[102]  Junfeng Li,et al.  Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication , 2011, Speech Commun..

[103]  Fabien Ringeval,et al.  Maximising Audiovisual Correlation with Automatic Lip Tracking and Vowel Based Segmentation , 2009, COST 2101/2102 Conference.

[104]  Sumit Kumar,et al.  IMPROVED HYBRID MODEL OF HMM/GMM FOR SPEECH RECOGNITION , 2008 .

[105]  Yuting Su,et al.  Robust Sea-Sky-Line Detection Based on Horizontal Projection and Hough Transformation , 2009, 2009 2nd International Congress on Image and Signal Processing.

[106]  Günther Palm,et al.  Spotting laughter in natural multiparty conversations: A comparison of automatic online and offline approaches using audiovisual data , 2012, TIIS.

[107]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[108]  Zhengyou Zhang,et al.  A Survey of Recent Advances in Face Detection , 2010 .

[109]  Albert S. Bregman,et al.  Auditory scene analysis : hearing in complex environments , 1993 .

[110]  Paula P. Henry,et al.  Evaluation of an adaptive, directional-microphone hearing aid: Evaluación de un auxiliar auditivo de micrófono direccional adaptable , 2002, International journal of audiology.

[111]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[112]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[113]  C. Jutten,et al.  Using a Visual Voice Activity Detector to Regularize the Permutations in Blind Separation of Convolutive Speech Mixtures , 2007, 2007 15th International Conference on Digital Signal Processing.

[114]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[115]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[116]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.

[117]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[118]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[119]  Alice Caplier,et al.  New color transformation for lips segmentation , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[120]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[121]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[122]  Jon Barker,et al.  Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models , 1999, AVSP.

[123]  John H. L. Hansen,et al.  An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[124]  Aude Billard,et al.  On Learning, Representing, and Generalizing a Task in a Humanoid Robot , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[125]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[126]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[127]  Alan Wee-Chung Liew,et al.  Lip contour extraction from color images using a deformable model , 2002, Pattern Recognit..

[128]  Ben P. Milner,et al.  Maximising audio-visual speech correlation , 2007, AVSP.

[129]  Amir Hussain,et al.  Intelligibility improvements using binaural diverse sub-band processing applied to speech corrupted with automobile noise , 2001 .

[130]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[131]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[132]  James M. Rehg,et al.  On the Design of Cascades of Boosted Ensembles for Face Detection , 2008, International Journal of Computer Vision.