Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features

The recent increase in social media based propaganda, i.e., ‘fake news’, calls for automated methods to detect tampered content. In this paper, we focus on detecting tampering in a video with a person speaking to a camera. This form of manipulation is easy to perform, since one can just replace a part of the audio, dramatically changing the meaning of the video. We consider several detection approaches based on phonetic features and recurrent networks. We demonstrate that by replacing standard MFCC features with embeddings from a DNN trained for automatic speech recognition, combined with mouth landmarks (visual features), we can achieve a significant performance improvement on several challenging publicly available databases of speakers (VidTIMIT, AMI, and GRID), for which we generated sets of tampered data. The evaluations demonstrate a relative equal error rate reduction of 55% (to 4.5% from 10.0%) on the large GRID corpus based dataset and a satisfying generalization of the model on other datasets.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[3]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[4]  Sébastien Marcel,et al.  Speaker Inconsistency Detection in Tampered Video , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[5]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[9]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[10]  Jean-Marc Odobez,et al.  Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media , 2016, ACM Multimedia.

[11]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[12]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jukka Komulainen,et al.  Audiovisual synchrony assessment for replay attack detection in talking face biometrics , 2015, Multimedia Tools and Applications.

[14]  Josef Kittler,et al.  An Anomaly Detection Approach to Face Spoofing Detection: A New Formulation and Evaluation Protocol , 2017, IEEE Access.