Discriminate Natural versus Loudspeaker Emitted Speech

In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may find relevant applications in (1) the far-field voice interactions of vocal interfaces such as Amazon Echo, Google Home, Facebook Portal, etc, and (2) the replay spoofing attack detection. The detection of loudspeaker emitted speech will help avoiding false wake-ups or unintended interactions with the devices in the first application, while eliminating attacks involve the replay of recordings collected from enrolled speakers in the second one. At first we collect a real-world dataset under well-controlled conditions containing two classes: recorded speeches directly spoken by numerous people (considered as the natural speech), and recorded speeches played back from various loudspeakers (considered as the loudspeaker emitted speech). Then from this dataset, we build prediction models based on Deep Neural Network (DNN) for which different combination of audio features have been considered. Experiment results confirm the feasibility of the task where the combination of audio embeddings extracted from SoundNet and VGGish network yields the classification accuracy up to about 90%.

[1]  Tanja Schultz,et al.  Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[2]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[4]  Sri Harish Reddy Mallidi,et al.  Device-directed Utterance Detection , 2018, INTERSPEECH.

[5]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[6]  Nicholas W. D. Evans,et al.  Re-assessing the threat of replay spoofing attacks against automatic speaker verification , 2014, 2014 International Conference of the Biometrics Special Interest Group (BIOSIG).

[7]  Dilek Z. Hakkani-Tür,et al.  Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog , 2012, INTERSPEECH.

[8]  Kong-Aik Lee,et al.  ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements , 2018, Odyssey.

[9]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[10]  Andreas Stolcke,et al.  Using Out-of-Domain Data for Lexical Addressee Detection in Human-Human-Computer Dialog , 2013, HLT-NAACL.

[11]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Haizhou Li,et al.  A study on replay attack and anti-spoofing for text-dependent speaker verification , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[13]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[14]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[15]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[16]  Gökhan Tür,et al.  Understanding computer-directed utterances in multi-user dialog systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Rafal Samborski,et al.  Playback attack detection for text-dependent speaker verification over telephone channels , 2015, Speech Commun..

[18]  Tetsuya Takiguchi,et al.  System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[19]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.