论文信息 - "Hello? Who Am I Talking to?" A Shallow CNN Approach for Human vs. Bot Speech Classification

"Hello? Who Am I Talking to?" A Shallow CNN Approach for Human vs. Bot Speech Classification

Automatic speech generation algorithms, enhanced by deep learning techniques, enable an increasingly seamless and immediate machine-to-human interaction. As a result, the latest generation of phone-calling bots sounds more convincingly human than previous generations. The application of this technology has a strong social impact in terms of privacy issues (e.g., in customer-care services), fraudulent actions (e.g., social hacking) and erosion of trust (e.g., generation of fake conversation). For these reasons, it is crucial to identify the nature of a speaker, as either a human or a bot. In this paper, we propose a speech classification algorithm based on Convolutional Neural Networks (CNNs), which enables the automatic classification of human vs non-human speakers from the analysis of short audio excerpts. We evaluate the effectiveness of the proposed solution by exploiting a real human speech database populated with audio recordings from various sources, and automatically generated speeches using state-of-the-art text-to-speech generators based on deep learning (e.g., Google WaveNet).

[1] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[4] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Douglas D. O'Shaughnessy,et al. Speech communication : human and machine , 1987 .

[6] Adam Doupé,et al. SoK: Everyone Hates Robocalls: A Survey of Techniques Against Telephone Spam , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[7] Alan W. Black,et al. The CMU Arctic speech databases , 2004, SSW.

[8] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11] Ira Kemelmacher-Shlizerman,et al. Synthesizing Obama , 2017, ACM Trans. Graph..

[12] Marek Hrúz,et al. Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14] Jürgen Quittek,et al. Detecting SPIT Calls by Checking Human Communication Patterns , 2007, 2007 IEEE International Conference on Communications.

[15] Dimitris Gritzalis,et al. The Sphinx enigma in critical VoIP infrastructures: Human or botnet? , 2013, IISA 2013.

[16] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Figen Ertaş,et al. FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[18] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Juho Kim,et al. Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras , 2017, ArXiv.

[20] Uwe Aickelin,et al. An Audio CAPTCHA to Distinguish Humans from Computers , 2010, 2010 Third International Symposium on Electronic Commerce and Security.