Talking-face Verification

This chapter addresses the relatively new area of identity verification based on talking faces. This biometric modality is intrinsically multimodal. Indeed, not only does it contain both voice and face modalities, but it also integrates the combined dynamics of voice and lip motion. First, an overview of the state of the art in the field of talking faces is given. The benchmarking evaluation framework for talking-face modality is then introduced. This framework (which is composed of reference systems, the well-known BANCA database, and its associated Pooled protocol P) aims to ensure a fair comparison of various talking-face verification algorithms. Next, research prototypes, whose main innovation is the use of a globally defined audiovisual synchrony, are evaluated within the benchmarking framework.

[1]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[2]  Anikendu Mahalanobis Prasanta Chandra Mahalanobis , 1983 .

[3]  B. Thompson Canonical Correlation Analysis , 1984 .

[4]  Fumitada Itakura,et al.  Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[8]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[9]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Claude C. Chibelushi,et al.  Integrated person identification using voice and facial features , 1997 .

[11]  Farzin Deravi,et al.  Feature-level data fusion for bimodal person recognition , 1997 .

[12]  Juergen Luettin,et al.  Acoustic-labial speaker verification , 1997, Pattern Recognit. Lett..

[13]  Gary R. Bradski,et al.  Real time face and object tracking as a component of a perceptual user interface , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[14]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[15]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[16]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[17]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[18]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[19]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[20]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Alvin F. Martin,et al.  The NIST 1999 Speaker Recognition Evaluation - An Overview , 2000, Digit. Signal Process..

[23]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[24]  Exemplar-based face recognition from video , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[25]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[26]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[27]  Rama Chellappa,et al.  3D face reconstruction from video using a generic model , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[28]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[29]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Gérard Chollet,et al.  BIOMET: A Multimodal Person Authentication Database Including Face, Voice, Fingerprint, Hand and Signature Modalities , 2003, AVBPA.

[32]  Roland Göcke,et al.  Statistical analysis of the relationship between audio and video speech parameters for Australian English , 2003, AVSP.

[33]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[34]  Christian Jutten,et al.  Speech extraction based on ICA and audio-visual coherence , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[35]  Harriet J. Nock,et al.  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  A. V. Nefian,et al.  Bayesian networks in multimodal speech recognition and speaker identification , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[37]  Richard B. Reilly,et al.  Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features , 2003, AVBPA.

[38]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[39]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[40]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[41]  Luc Vandendorpe,et al.  Face authentication test on the BANCA database , 2004, ICPR 2004.

[42]  Michael Wagner,et al.  "liveness" Verification in Audio-video Authentication , 2004, INTERSPEECH.

[43]  Rama Chellappa,et al.  Visual tracking and recognition using appearance-adaptive models in particle filters , 2004, IEEE Transactions on Image Processing.

[44]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[45]  Josef Bigün,et al.  Evaluating liveness by face images and the structure tensor , 2005, Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05).

[46]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[47]  Rolf Ingold,et al.  MYIDEA - MULTIMODAL BIOMETRICS DATABASE, DESCRIPTION OF ACQUISITION PROTOCOLS , 2005 .

[48]  N. Eveno,et al.  Co-inertia analysis for "liveness" test in audio-visual biometrics , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[49]  Ian R. Fasel,et al.  A generative framework for real time object detection and classification , 2005, Comput. Vis. Image Underst..

[50]  Alvin F. Martin,et al.  The NIST speaker recognition evaluation program , 2005 .

[51]  Laurent Besacier,et al.  A speaker independent "liveness" test for audio-visual biometrics , 2005, INTERSPEECH.

[52]  Sridha Sridharan,et al.  Comparing audio and visual information for speech processing , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[53]  Tsuhan Chen,et al.  Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition , 2005, IEEE Transactions on Multimedia.

[54]  Enrico Grosso,et al.  Face Authentication Using One-Class Support Vector Machines , 2005, IWBRS.

[55]  Javier Lorenzo-Navarro,et al.  Multiple Face Detection at Different Resolutions for Perceptual User Interfaces , 2005, IbPRIA.

[56]  Arun Ross,et al.  Handbook of Multibiometrics , 2006, The Kluwer international series on biometrics.

[57]  Gérard Chollet,et al.  GMM-based SVM for face recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[58]  Jean-Philippe Thiran,et al.  Automatic Extraction of Geometric Lip Features with Application to Multi-Modal Speaker Identification , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[59]  Ian H. Witten,et al.  Detecting Replay Attacks in Audiovisual Identity Verification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[60]  Enrico Grosso,et al.  Person Authentication from Video of Faces: A Behavioral and Physiological Approach Using Pseudo Hierarchical Hidden Markov Models , 2006, ICB.

[61]  A. Murat Tekalp,et al.  Multimodal Speaker Identification Using Canonical Correlation Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[62]  Aggelos K. Katsaggelos,et al.  Audio-Visual Biometrics , 2006, Proceedings of the IEEE.

[63]  Jean-Luc Dugelay,et al.  Person Recognition based on Head and Mouth Dynamics , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[64]  Gérard Chollet,et al.  Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[65]  Ralph Gross,et al.  Robust Biometric Person Identification Using Automatic Classifier Fusion of Speech, Mouth, and Face Experts , 2007, IEEE Transactions on Multimedia.

[66]  Gérard Chollet,et al.  Audiovisual Speech Synchrony Measure: Application to Biometrics , 2007, EURASIP J. Adv. Signal Process..

[67]  Jang-Hee Yoo,et al.  Liveness Detection for Embedded Face Recognition System , 2008 .