Audiovisual Synchrony Detection with Optimized Audio Features

Audiovisual speech synchrony detection is an important part of talking-face verification systems. Prior work has primarily focused on visual features and joint-space models, while standard mel-frequency cepstral coefficients (MFCCs) have been commonly used to present speech. We focus more closely on audio by studying the impact of context window length for delta feature computation and comparing MFCCs with simpler energy-based features in lip-sync detection. We select state-of-the-art hand-crafted lip-sync visual features, space-time auto-correlation of gradients (STACOG), and canonical correlation analysis (CCA), for joint-space modeling. To enhance joint space modeling, we adopt deep CCA (DCCA), a nonlinear extension of CCA. Our results on the XM2VTS data indicate substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%. Further analysis reveals that failed lip region localization and beardedness of the subjects constitutes most of the errors. Thus, the lip motion description is the bottleneck, while the use of novel audio features or joint-modeling techniques is unlikely to boost lip-sync detection accuracy further.

[1]  Josef Bigün,et al.  Real-Time Face Detection and Motion Analysis With Application in “Liveness” Assessment , 2007, IEEE Transactions on Information Forensics and Security.

[2]  Aleksandr Melnikov,et al.  Audiovisual Liveness Detection , 2015, ICIAP.

[3]  Jukka Komulainen,et al.  On the robustness of audiovisual liveness detection to visual speech animation , 2016, 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[4]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[5]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[6]  Jukka Komulainen,et al.  Face Spoofing Detection Using Colour Texture Analysis , 2016, IEEE Transactions on Information Forensics and Security.

[7]  Takumi Kobayashi,et al.  Motion recognition using local auto-correlation of space-time gradients , 2012, Pattern Recognit. Lett..

[8]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Michael Wagner,et al.  Liveness detection using cross-modal correlations in face-voice person authentication , 2005, INTERSPEECH.

[10]  Venu Govindaraju,et al.  Robustness of multimodal biometric fusion methods against spoof attacks , 2009, J. Vis. Lang. Comput..

[11]  Laurent Besacier,et al.  A speaker independent "liveness" test for audio-visual biometrics , 2005, INTERSPEECH.

[12]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[13]  Enrique Argones-Rúa,et al.  Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models , 2009, Pattern Analysis and Applications.

[14]  Hagai Aronowitz,et al.  Text-Dependent Audiovisual Synchrony Detection for Spoofing Detection in Mobile Person Recognition , 2016, INTERSPEECH.

[15]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[16]  Girija Chetty,et al.  Biometric liveness detection based on cross modal fusion , 2009, 2009 12th International Conference on Information Fusion.

[17]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[18]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[19]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[20]  Matti Pietikäinen,et al.  Local spatiotemporal descriptors for visual recognition of spoken phrases , 2007, HCM '07.

[21]  Gérard Chollet,et al.  Making talking-face authentication robust to deliberate imposture , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Gerasimos Potamianos,et al.  Robust audio-visual speech synchrony detection by generalized bimodal linear prediction , 2009, INTERSPEECH.

[23]  Vaibhava Goel,et al.  Detecting audio-visual synchrony using deep neural networks , 2015, INTERSPEECH.

[24]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[25]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[26]  Jukka Komulainen,et al.  Audiovisual synchrony assessment for replay attack detection in talking face biometrics , 2015, Multimedia Tools and Applications.

[27]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .