An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database.

[1]  Alfred DeMaris,et al.  A Tutorial in Logistic Regression , 1995 .

[2]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Gwen Littlewort,et al.  Faces of pain: automated measurement of spontaneousallfacial expressions of genuine and posed pain , 2007, ICMI '07.

[5]  Justus H. Piater,et al.  Online Learning of Gaussian Mixture Models - a Two-Level Approach , 2008, VISAPP.

[6]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[10]  Marian Stewart Bartlett,et al.  Measuring the Perceived Difficulty of a Lecture Using Automatic Facial Expression Recognition , 2008, Intelligent Tutoring Systems.

[11]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[12]  Gwen Littlewort,et al.  Machine learning methods for fully automatic recognition of facial expressions and facial actions , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[13]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Gwen Littlewort,et al.  Automatic Recognition of Facial Actions in Spontaneous Expressions , 2006, J. Multim..

[15]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[16]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[17]  James M. Rehg,et al.  Vision-based speaker detection using Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[18]  Ben J. A. Kröse,et al.  On-line multi-modal speaker diarization , 2007, ICMI '07.

[19]  Marian Stewart Bartlett,et al.  Automatic facial expression recognition for intelligent tutoring systems , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Paul Mineiro,et al.  Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition , 1998, Machine Learning.

[21]  W. Marsden I and J , 2012 .

[22]  Gwen Littlewort,et al.  Drowsy Driver Detection Through Facial Movement Analysis , 2007, ICCV-HCI.

[23]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.