Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

This paper addresses the subject of liveness detection, which is a test that ensures that biometric cues are acquired from a live person who is actually present at the time of capture. The liveness check is performed by measuring the degree of synchrony between the lips and the voice extracted from a video sequence. Three new methods for asynchrony detection based on co-inertia analysis (CoIA) and a fourth based on coupled hidden Markov models (CHMMs) are derived. Experimental comparisons are made with several methods previously used in the literature for asynchrony detection and speaker location. The reported results demonstrate the effectiveness and superiority of the proposed new methods based on both CoIA and CHMMs as asynchrony detection methods.

[1]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[2]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[3]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[4]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[5]  Thomas S. Huang,et al.  A new approach to integrate audio and visual features of speech , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[6]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[7]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[8]  Mark A. Clements,et al.  Bimodal fusion in audio-visual speech recognition , 2002, Proceedings. International Conference on Image Processing.

[9]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[10]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[12]  Ara V. Nefian,et al.  Audio-visual continuous speech recognition using a coupled hidden Markov model , 2002, INTERSPEECH.

[13]  Sabri Gurbuz,et al.  Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Chalapathy Neti,et al.  Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[15]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[16]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[17]  Michael Wagner,et al.  "liveness" Verification in Audio-video Authentication , 2004, INTERSPEECH.

[18]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[19]  Jean-Luc Rouas,et al.  Weighted loss functions to make risk-based language identification fused decisions , 2004, ICPR 2004.

[20]  Samy Bengio,et al.  A statistical significance test for person authentication , 2004, Odyssey.

[21]  Régine André-Obrecht,et al.  Weighted loss functions to make risk-based language identification fused decisions , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Xuelong Li,et al.  Supervised tensor learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Laurent Besacier,et al.  A speaker independent "liveness" test for audio-visual biometrics , 2005, INTERSPEECH.

[24]  Xuelong Li,et al.  Supervised Tensor Learning , 2005, ICDM.

[25]  G. Chollet,et al.  The BioSecure Talking-Face Reference System , 2006 .

[26]  Gérard Chollet,et al.  MEASURING AUDIO AND VISUAL SPEECH SYNCHRONY: METHODS AND APPLICATIONS , 2006 .

[27]  Xuelong Li,et al.  General Tensor Discriminant Analysis and Gabor Features for Gait Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.