Multimodal feature extraction and fusion for audio-visual speech recognition

Multimodal signal processing analyzes a physical phenomenon through several types of measures, or modalities. This leads to the extraction of higher-quality and more reliable information than that obtained from single-modality signals. The advantage is two-fold. First, as the modalities are usually complementary, the end-result of multimodal processing is more informative than for each of the modalities individually, which represents the first advantage. This is true in all application domains: human-machine interaction, multimodal identification or multimodal image processing. The second advantage is that, as modalities are not always reliable, it is possible, when one modality becomes corrupted, to extract the missing information from the other one. There are two essential challenges in multimodal signal processing. First, the features used from each modality need to be as relevant and as few as possible. The fact that multimodal systems have to process more than just one modality means that they can run into errors caused by the curse of dimensionality much more easily than mono-modal ones. The curse of dimensionality is a term used essentially to say that the number of equally-distributed samples required to cover a region of space grows exponentially with the dimensionality of the space. This has important implications in the classification domain, since accurate models can only be obtained if an adequate number of samples is available, and obviously this required number of samples grows with the dimensionality of the features. Dimensionality reduction is thus a necessary step in any application dealing with complex signals, and this is achieved through selection, transforms or the combination of the two. The second essential challenge is multimodal integration. Since the signals involved do not necessarily have the same data rate, range or even dimensionality, combining information coming from such different sources is not straightforward. This can be done at different levels, starting from the basic signal level by combining the signals themselves, if they are compatible, up to the highest decision level, where only the individual decisions taken based on the signals are combined. Ideally, the fusion method should allow temporal variations in the relative importance of the two streams, to account for possible changes in their quality. However, this can only be done with methods operating at a high decision level. The aim of this thesis is to offer solutions to both these challenges, in the context of audio-visual speech recognition and speaker localization. Both these applications are from the field of human-machine interaction. Audio-visual speech recognition aims to improve the accuracy of speech recognizers by augmenting the audio with information extracted from the video, more particularly, the movement of the speaker's lips. This works well especially when the audio is corrupted, leading in this case to significant gains in accuracy. Speaker localization means detecting who is the active speaker in a audio-video sequence containing several persons, something that is useful for videoconferencing and the automated annotation of meetings. These two applications are the context in which we present our solutions to both feature selection and multimodal integration. First, we show how informative features can be extracted from the visual modality, using an information-theoretic framework which gives us a quantitative measure of the relevance of individual features. We also prove that reducing redundancy between these features is important for avoiding the curse of dimensionality and improving recognition results. The methods that we present are novel in the field of audio-visual speech recognition and we found that their use leads to significant improvements compared to the state of the art. Second, we present a method of multimodal fusion at the level of intermediate decisions using a weight for each of the streams. The weights are adaptive, changing according to the estimated reliability of each stream. This makes the system tolerant to changes in the quality of either stream, and even to the temporary interruption of one of the streams. The reliability estimate is based on the entropy of the posterior probability distributions of each stream at the intermediate decision level. Our results are superior to those obtained with a state of the art method based on maximizing the same posteriors. Moreover, we analyze the effect of a constraint typically imposed on stream weights in the literature, the constraint that they should sum to one. Our results show that removing this constraint can lead to improvements in recognition accuracy. Finally, we develop a method for audio-visual speaker localization, based on the correlation between audio energy and the movement of the speaker's lips. Our method is based on a joint probability model of the audio and video which is used to build a likelihood map showing the likely positions of the speaker's mouth. We show that our novel method performs better than a similar method from the literature. In conclusion, we analyze two different challenges of multimodal signal processing for two audio-visual problems, and offer innovative approaches for solving them.

[1]  C. R. Rao,et al.  The Utilization of Multiple Measurements in Problems of Biological Classification , 1948 .

[2]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[3]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[4]  Pierre Vandergheynst,et al.  Analysis of multimodal sequences using geometric video representations , 2006, Signal Process..

[5]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[6]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[7]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[8]  Daniel P. W. Ellis,et al.  Using Broad Phonetic Group Experts for Improved Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[10]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[11]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[12]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[13]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[14]  Jean-Philippe Thiran,et al.  Using entropy as a stream reliability estimate for audio-visual speech recognition , 2008, 2008 16th European Signal Processing Conference.

[15]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[16]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[17]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[18]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[19]  Stephen J. Cox,et al.  Combining noise compensation with visual information in speech recognition , 1997, AVSP.

[20]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Hervé Glotin,et al.  A new SNR-feature mapping for robust multistream speech recognition , 1999 .

[23]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[24]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[25]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[26]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[27]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Daniel P. W. Ellis,et al.  Using mutual information to design class-specific phone recognizers , 2003, INTERSPEECH.

[29]  Dana H. Ballard,et al.  Computer Vision , 1982 .

[30]  Jan Flusser,et al.  Image registration methods: a survey , 2003, Image Vis. Comput..

[31]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[32]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[33]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[34]  Alexandros Potamianos,et al.  Unsupervised Stream Weight Estimation using Anti-Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[36]  Pierre Jourlin Word-dependent acoustic-labial weights in HMM-based speech recognition , 1997, AVSP.

[37]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[38]  Johanna D. Moore,et al.  Proceedings of Interspeech 2008 , 2008 .

[39]  S. S. Stevens,et al.  The Relation of Pitch to Frequency: A Revised Scale , 1940 .

[40]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[41]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[42]  Jeff A. Bilmes,et al.  Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[43]  D.P. Morgan,et al.  The application of dynamic programming to connected speech recognition , 1990, IEEE ASSP Magazine.

[44]  Joseph Picone,et al.  Hybrid SVM/HMM architectures for speech recognition , 2000, INTERSPEECH.

[45]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[46]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  D. W. Robinson,et al.  A re-determination of the equal-loudness relations for pure tones , 1956 .

[48]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[50]  Richard B. Reilly,et al.  Feature analysis for automatic speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[51]  Lambert Schomaker,et al.  Audio visual and Multimodal Speech Systems , 2003 .

[52]  Darryl Stewart,et al.  Audio-visual integration for robust speech recognition using maximum weighted stream posteriors , 2007, INTERSPEECH.

[53]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[54]  Jean-Philippe Thiran,et al.  Low-dimensional motion features for audio-visual speech recognition , 2007, 2007 15th European Signal Processing Conference.

[55]  Jean-Philippe Thiran,et al.  Relevant Feature Selection for Audio-Visual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[56]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[57]  Ephraim Feig,et al.  Fast algorithms for the discrete cosine transform , 1992, IEEE Trans. Signal Process..

[58]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[60]  Pierre Vandergheynst,et al.  Learning Multimodal Dictionaries , 2007, IEEE Transactions on Image Processing.

[61]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[62]  Darryl Stewart,et al.  A new posterior based audio-visual integration method for robust speech recognition , 2005, INTERSPEECH.

[63]  Gerasimos Potamianos,et al.  Exploiting lower face symmetry in appearance-based automatic speechreading , 2005, AVSP.

[64]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[65]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[66]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[67]  Jean-Philippe Thiran,et al.  Feature space mutual information in speech-video sequences , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[68]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[69]  Dimitrios Tzovaras Multimodal user interfaces : from signals to interaction , 2008 .

[70]  Jean-Philippe Thiran,et al.  From error probability to information theoretic (multi-modal) signal processing , 2005, Signal Process..

[71]  John Makhoul,et al.  Spectral linear prediction: Properties and applications , 1975 .

[72]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[73]  Scott Axelrod,et al.  Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[75]  Steve Young,et al.  The HTK book , 1995 .

[76]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[77]  Tsuhan Chen,et al.  Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[78]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[79]  Jeffrey C. Schlimmer,et al.  Efficiently Inducing Determinations: A Complete and Systematic Search Algorithm that Uses Optimal Pruning , 1993, ICML.

[80]  Robert P. W. Duin,et al.  Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Keiichi Tokuda,et al.  Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights , 2000, INTERSPEECH.

[82]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[83]  Martin Heckmann,et al.  A hybrid ANN/HMM audio-visual speech recognition system , 2001, AVSP.

[84]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[85]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[86]  Chalapathy Neti,et al.  Improved ROI and within frame discriminant features for lipreading , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[87]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[88]  Jean-Philippe Thiran,et al.  A multimodal approach to extract optimized audio features for speaker detection , 2005, 2005 13th European Signal Processing Conference.

[89]  Sadaoki Furui,et al.  A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[90]  C. H. Chen,et al.  Handbook of Pattern Recognition and Computer Vision , 1993 .

[91]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[92]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[93]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[94]  Satoshi Nakamura,et al.  Stream weight optimization of speech and lip image sequence for audio-visual speech recognition , 2000, INTERSPEECH.

[95]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[96]  Hervé Glotin,et al.  Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[97]  Arun Ross,et al.  Multimodal biometrics: An overview , 2004, 2004 12th European Signal Processing Conference.

[98]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[99]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[100]  Gerasimos Potamianos,et al.  Mutual information based visual feature selection for lipreading , 2004, INTERSPEECH.

[101]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[102]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[103]  Pierre Vandergheynst,et al.  Learning Multi-Modal Dictionaries , 2006 .

[104]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[105]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[106]  Stefan Bilbao,et al.  Proceedings of the European Signal Processing Conference , 2005 .

[107]  Sabri Gurbuz,et al.  Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[108]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[109]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[110]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[111]  Sadaoki Furui,et al.  A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[112]  Josef Kittler,et al.  Fast branch & bound algorithms for optimal feature selection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[113]  Chalapathy Neti,et al.  Automatic speechreading of impaired speech , 2001, AVSP.

[114]  D. Howard,et al.  Speech and audio signal processing: processing and perception of speech and music [Book Review] , 2000 .

[115]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[116]  Jean-Philippe Thiran,et al.  Mutual information eigenlips for audio-visual speech recognition , 2006, 2006 14th European Signal Processing Conference.

[117]  Jean-Philippe Thiran,et al.  Multimodal speaker localization in a probabilistic framework , 2006, 2006 14th European Signal Processing Conference.

[118]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[119]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[120]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[121]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[122]  Jean-Philippe Thiran,et al.  Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.

[123]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[124]  Ralph Gross,et al.  Robust Biometric Person Identification Using Automatic Classifier Fusion of Speech, Mouth, and Face Experts , 2007, IEEE Transactions on Multimedia.

[125]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[126]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[127]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..