Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay Attacks

Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces and proposes a countermeasure to detect replay attack on these platforms. This paper presents a novel framework to model replay attack distortion, and then use a non-learning-based method for replay attack detection on smart speakers. The reply attack distortion is modeled as a higher-order nonlinearity in the replay attack audio. Higher-order spectral analysis (HOSA) is used to capture characteristics distortions in the replay audio. Effectiveness of the proposed countermeasure scheme is evaluated on original speech as well as corresponding replayed recordings. The replay attack recordings are successfully injected into the Google Home device via Amazon Alexa using the drop-in conferencing feature.

[1]  Chrysostomos L. Nikias,et al.  Higher-order spectral analysis , 1993, Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ.

[2]  Hafiz Malik,et al.  Microphone Identification Using Higher-Order Statistics , 2012 .

[3]  Ravika Naika,et al.  An Overview of Automatic Speaker Verification System , 2018 .

[4]  Suryakanth V. Gangashetty,et al.  SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017 , 2017, INTERSPEECH.

[5]  Parav Nagarsheth,et al.  Replay Attack Detection Using DNN for Channel Discrimination , 2017, INTERSPEECH.

[6]  Eliathamby Ambikairajah,et al.  Warped Magnitude and Phase-Based Features for Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  C. L. Nikias,et al.  Higher-order spectra analysis : a nonlinear signal processing framework , 1993 .

[8]  M. Hinich Testing for Gaussianity and Linearity of a Stationary Time Series. , 1982 .

[9]  Madhu R. Kamble,et al.  Novel Variable Length Teager Energy Separation Based Instantaneous Frequency Features for Replay Detection , 2017, INTERSPEECH.

[10]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[11]  Eliathamby Ambikairajah,et al.  Detection of Replay-Spoofing Attacks Using Frequency Modulation Features , 2018, INTERSPEECH.

[12]  Petros Maragos,et al.  Robust AM-FM features for speech recognition , 2005, IEEE Signal Processing Letters.

[13]  Dan Wu,et al.  Ensemble Learning for Countermeasure of Audio Replay Spoofing Attack in ASVspoof2017 , 2017, INTERSPEECH.

[14]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[15]  Hafiz Malik Securing Speaker Verification System Against Replay Attack , 2012 .

[16]  Vidhyasaharan Sethu,et al.  Group delay features for emotion detection , 2007, INTERSPEECH.