Audio Recapture Detection With Convolutional Neural Networks

In this paper, we investigate how features can be effectively learned by deep neural networks for audio forensic problems. By providing a preliminary feature preprocessing based on electric network frequency (ENF) analysis, we propose a convolutional neural network (CNN) for training and classification of genuine and recaptured audio recordings. Hierarchical representations which contain levels of details of the ENF components are learned from the deep neural networks and can be used for further classification. The proposed method works for small audio clips of 2 second duration, whereas the state of the art may fail with such small audio clips. Experimental results demonstrate that the proposed network yields high detection accuracy with each ENF harmonic component represented as a single-channel input. The performance can be further improved by a combined input representation which incorporates both the fundamental ENF and its harmonics. The convergence property of the network and the effect of using an analysis window with various sizes are also studied. Performance comparison against the support tensor machine demonstrates the advantage of using CNN for the task of audio recapture detection. Moreover, visualization of the intermediate feature maps provides some insight into what the deep neural networks actually learn and how they make decisions.

[1]  Thirapiroon Thongkamwitoon,et al.  An Image Recapture Detection Algorithm Based on Learning Dictionaries of Edge Profiles , 2015, IEEE Transactions on Information Forensics and Security.

[2]  Catalin Grigoras Applications of ENF criterion in forensic audio, video, computer and telecommunication analysis. , 2007, Forensic science international.

[3]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[6]  Z. Jane Wang,et al.  Median Filtering Forensics Based on Convolutional Neural Networks , 2015, IEEE Signal Processing Letters.

[7]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[8]  Min Wu,et al.  How secure are power network signature based time stamps? , 2012, CCS.

[9]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[10]  Xingming Sun,et al.  Segmentation-Based Image Copy-Move Forgery Detection Scheme , 2015, IEEE Transactions on Information Forensics and Security.

[11]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[12]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[13]  Rui Yang,et al.  Geometric Invariant Audio Watermarking Based on an LCM Feature , 2011, IEEE Transactions on Multimedia.

[14]  Tian-Tsong Ng,et al.  Recaptured photo detection using specularity distribution , 2008, 2008 15th IEEE International Conference on Image Processing.

[15]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[16]  N. Sudha,et al.  Exposing Digital Image Forgeries by Detecting Discrepancies in Motion Blur , 2011, IEEE Transactions on Multimedia.

[17]  Irene Kotsia,et al.  Support tucker machines , 2011, CVPR 2011.

[18]  Daniel Patricio Nicolalde Rodríguez,et al.  Audio Authenticity: Detecting ENF Discontinuity With High Precision Phase Analysis , 2010, IEEE Transactions on Information Forensics and Security.

[19]  Wen Gao,et al.  Video Copy-Detection and Localization with a Scalable Cascading Framework , 2013, IEEE MultiMedia.

[20]  Christoph Ruland,et al.  Secure and Robust Two-Phase Image Authentication , 2015, IEEE Transactions on Multimedia.

[21]  Pier Luigi Dragotti,et al.  An investigation into aliasing in images recaptured from an LCD monitor using a digital camera , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Adi Hajj-Ahmad,et al.  ENF analysis on recaptured audio recordings , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Weiwei Guo,et al.  Higher rank Support Tensor Machines for visual recognition , 2012, Pattern Recognit..

[24]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  K. J. Ray Liu,et al.  Temporal Forensics and Anti-Forensics for Motion Compensated Video , 2012, IEEE Transactions on Information Forensics and Security.