MEASURING AUDIO AND VISUAL SPEECH SYNCHRONY: METHODS AND APPLICATIONS

Speech is a means of communication that is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech and more specifically on techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual or joint audiovisual feature spaces and the actual measure of correspondence between audio and visual speech. Finally, applications of this specific task are described.

[1]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[2]  Christian Jutten,et al.  Speech extraction based on ICA and audio-visual coherence , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[3]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[4]  A. Nakamura,et al.  Nature (London , 1975 .

[5]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[6]  Roland Göcke,et al.  Statistical analysis of the relationship between audio and video speech parameters for Australian English , 2003, AVSP.

[7]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[8]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[10]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[11]  Claude C. Chibelushi,et al.  Integrated person identification using voice and facial features , 1997 .

[12]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[13]  Gilbert Saporta,et al.  Probabilités, Analyse des données et statistique , 1991 .

[14]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[15]  Jon Barker,et al.  Evidence of correlation between acoustic and visual features of speech , 1999 .

[16]  Fumitada Itakura,et al.  Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[17]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[18]  Michael Wagner,et al.  "liveness" Verification in Audio-video Authentication , 2004, INTERSPEECH.

[19]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[20]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Ian H. Witten,et al.  Detecting Replay Attacks in Audiovisual Identity Verification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[24]  Laurent Besacier,et al.  A speaker independent "liveness" test for audio-visual biometrics , 2005, INTERSPEECH.

[25]  Richard B. Reilly,et al.  Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features , 2003, AVBPA.

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Harriet J. Nock,et al.  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[28]  Alan Murray,et al.  Advances in Neural Information Processing Systems 2003 , 2003 .

[29]  N. Eveno,et al.  Co-inertia analysis for "liveness" test in audio-visual biometrics , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[30]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[31]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.