Human Perception of Audio Deepfakes

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques. Human detection capabilities, however, are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 410 unique users competed against a state-of-the-art AI deepfake detection algorithm for 13229 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.

[1]  J. Yamagishi,et al.  The PartialSpoof Database and Countermeasures for the Detection of Short Generated Audio Segments Embedded in a Speech Utterance , 2022, ArXiv.

[2]  J. Yamagishi,et al.  The VoiceMOS Challenge 2022 , 2022, INTERSPEECH.

[3]  J. Yamagishi,et al.  Generalization Ability of MOS Prediction Networks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yisroel Mirsky,et al.  The Creation and Detection of Deepfakes , 2020, ACM Comput. Surv..

[5]  Tomi Kinnunen,et al.  Benchmarking and challenges in security and privacy for voice biometrics , 2021, 2021 ISCA Symposium on Security and Privacy in Speech Communication.

[6]  Konstantin Böttinger,et al.  Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn? , 2021, 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge.

[7]  Sébastien Marcel,et al.  Subjective and Objective Evaluation of Deepfake Videos , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Rosalind W. Picard,et al.  Deepfake detection by human crowds, machines, and machine-informed crowds , 2021, Proceedings of the National Academy of Sciences.

[9]  J. Patino,et al.  An Initial Investigation for Detecting Partially Spoofed Audio , 2021, Interspeech.

[10]  Xin Wang,et al.  A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection , 2021, Interspeech.

[11]  A. Nautsch,et al.  End-to-End anti-spoofing with RawNet2 , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021 .

[13]  Rosalind W. Picard,et al.  Comparing Human and Machine Deepfake Detection with Affective and Holistic Processing , 2021, ArXiv.

[14]  Zheng Wang,et al.  Densely Connected Convolutional Network for Audio Spoofing Detection , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[15]  Xiao Zhou,et al.  The Blizzard Challenge 2020 , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[16]  Junichi Yamagishi,et al.  Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[17]  Pavel Korshunov,et al.  Deepfake detection: humans vs. machines , 2020, ArXiv.

[18]  Saniat Javid Sohrawardi,et al.  Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection , 2020, IEEE Journal of Selected Topics in Signal Processing.

[19]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[20]  Joanna Rownicka,et al.  Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis , 2020, Odyssey.

[21]  High Frequency Hearing Loss , 2020, Definitions.

[22]  Lauri Juvela,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[23]  Mani B. Srivastava,et al.  Deep Residual Neural Networks for Audio Spoofing Detection , 2019, INTERSPEECH.

[24]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[25]  Bob L. Sturm,et al.  Ensemble Models for Spoofing Detection in Automatic Speaker Verification , 2019, INTERSPEECH.

[26]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Christoph Busch,et al.  Standards for Biometric Presentation Attack Detection , 2019, Handbook of Biometric Anti-Spoofing, 2nd Ed..

[28]  M. Westerlund The Emergence of Deepfake Technology: A Review , 2019, Technology Innovation Management Review.

[29]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[30]  Robert M. Chesney,et al.  Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security , 2018 .

[31]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[33]  Heinz Mandl,et al.  How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfaction , 2017, Comput. Hum. Behav..

[34]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[35]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[36]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[37]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[38]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[39]  Rebecca Brent,et al.  ACTIVE LEARNING: AN INTRODUCTION * , 2009 .