Audiovisual synchrony assessment for replay attack detection in talking face biometrics

Audiovisual speech synchrony detection is an important liveness check for talking face verification systems in order to make sure that the input biometric samples are actually acquired from the same source. In prior work, the used visual speech features have been mainly describing facial appearance or mouth shape in frame-wise manner, thus ignoring the lip motion between consecutive frames. Since also the visual speech dynamics are important, we take the spatiotemporal information into account and propose the use of space-time auto-correlation of gradients (STACOG) for measuring the audiovisual synchrony. For evaluating the effectiveness of the proposed approach, a set of challenging and realistic attack scenarios are designed by augmenting publicly available BANCA and XM2VTS datasets with synthetic replay attacks. Our experimental analysis shows that the STACOG features outperform the state of the art, e.g. discrete cosine transform based features, in measuring the audiovisual synchrony.

[1]  Gérard Chollet,et al.  Making talking-face authentication robust to deliberate imposture , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Zhifeng Wang,et al.  Liveness detection using time drift between lip movement and voice , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[3]  Shaogang Gong,et al.  Audio- and Video-based Biometric Person Authentication , 1997, Lecture Notes in Computer Science.

[4]  Josef Bigün,et al.  Audio-visual person authentication using lip-motion from orientation maps , 2007, Pattern Recognit. Lett..

[5]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[6]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[7]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[8]  Yoichi Sato,et al.  Recovery of audio-to-video synchronization through analysis of cross-modality correlation , 2010, Pattern Recognit. Lett..

[9]  Walid Karam,et al.  Talking-Face Identity Verification, Audiovisual Forgery, and Robustness Issues , 2009, EURASIP J. Adv. Signal Process..

[10]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[11]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[12]  E. Mayoraz,et al.  Fusion of face and speech data for person identity verification , 1999, IEEE Trans. Neural Networks.

[13]  Takumi Kobayashi,et al.  Image Feature Extraction Using Gradient Local Auto-Correlations , 2008, ECCV.

[14]  Ajmal S. Mian,et al.  Correlation based speech-video synchronization , 2011, Pattern Recognit. Lett..

[15]  Enrique Argones-Rúa,et al.  Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models , 2009, Pattern Analysis and Applications.

[16]  N. Eveno,et al.  Co-inertia analysis for "liveness" test in audio-visual biometrics , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[17]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Girija Chetty,et al.  Biometric liveness detection based on cross modal fusion , 2009, 2009 12th International Conference on Information Fusion.

[19]  Václav Hlavác,et al.  Real-time multi-view facial landmark detector learned by the structured output SVM , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[20]  Takumi Kobayashi,et al.  Motion recognition using local auto-correlation of space-time gradients , 2012, Pattern Recognit. Lett..

[21]  Walid Karam,et al.  Some results from the biosecure talking face evaluation campaign , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Girija Chetty Robust Audio Visual Biometric Person Authentication with Liveness Verification , 2010, Intelligent Multimedia Analysis for Security Applications.

[23]  Venu Govindaraju,et al.  Robustness of multimodal biometric fusion methods against spoof attacks , 2009, J. Vis. Lang. Comput..

[24]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[25]  J. van Leeuwen,et al.  Audio- and Video-Based Biometric Person Authentication , 2001, Lecture Notes in Computer Science.