"Hello? Who Am I Talking to?" A Shallow CNN Approach for Human vs. Bot Speech Classification

Automatic speech generation algorithms, enhanced by deep learning techniques, enable an increasingly seamless and immediate machine-to-human interaction. As a result, the latest generation of phone-calling bots sounds more convincingly human than previous generations. The application of this technology has a strong social impact in terms of privacy issues (e.g., in customer-care services), fraudulent actions (e.g., social hacking) and erosion of trust (e.g., generation of fake conversation). For these reasons, it is crucial to identify the nature of a speaker, as either a human or a bot. In this paper, we propose a speech classification algorithm based on Convolutional Neural Networks (CNNs), which enables the automatic classification of human vs non-human speakers from the analysis of short audio excerpts. We evaluate the effectiveness of the proposed solution by exploiting a real human speech database populated with audio recordings from various sources, and automatically generated speeches using state-of-the-art text-to-speech generators based on deep learning (e.g., Google WaveNet).

[1]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[4]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[6]  Adam Doupé,et al.  SoK: Everyone Hates Robocalls: A Survey of Techniques Against Telephone Spam , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[7]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[8]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[12]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Jürgen Quittek,et al.  Detecting SPIT Calls by Checking Human Communication Patterns , 2007, 2007 IEEE International Conference on Communications.

[15]  Dimitris Gritzalis,et al.  The Sphinx enigma in critical VoIP infrastructures: Human or botnet? , 2013, IISA 2013.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[18]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Juho Kim,et al.  Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras , 2017, ArXiv.

[20]  Uwe Aickelin,et al.  An Audio CAPTCHA to Distinguish Humans from Computers , 2010, 2010 Third International Symposium on Electronic Commerce and Security.