Analysis of the robustness of neural network-based target activity detection

Many applications in audio signal processing require a precise identification of time frames where a predefined target source is active. In previous work, Artificial Neural Networks (ANNs) with crosscorrelation features showed a considerable potential in this field. In this paper, the performance of ANN-based target activity detection is analyzed in more detail and compared with a well-performing "classical" signal processing method. On the one hand, the impact of the angular distance between target source and interferers is evaluated for both the neural network-based method and the classical one. On the other hand, the sensitivity of both methods to varying Signal-to-Noise Ratio (SNR) conditions is analyzed with respect to the importance of a proper choice of detection thresholds. In the evaluations, the ANN-based method proves its general superiority and also its robustness with respect to a non-ideal choice of detection thresholds.

[1]  Bobby R. Hunt,et al.  Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier , 1993, IEEE Trans. Speech Audio Process..

[2]  Sriram Srinivasan,et al.  Spatial audio activity detection for hearing aids , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Jun Du,et al.  A universal VAD based on jointly trained deep neural networks , 2015, INTERSPEECH.

[4]  Gerhard Schmidt,et al.  Improved Performance Measures for Voice Activity Detection , 2014, ITG Symposium on Speech Communication.

[5]  Ashish Koul,et al.  Using Intermicrophone Correlation to Detect Speech in Spatially Separated Noise , 2006, EURASIP J. Adv. Signal Process..

[6]  Walter Kellermann,et al.  Efficient target activity detection based on recurrent neural networks , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[7]  John H. L. Hansen,et al.  An efficient microphone array based voice activity detector for driver's speech in noise and music rich in-vehicle environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Emanuel A. P. Habets,et al.  Minimum Bayes risk signal detection for speech enhancement based on a narrowband DOA model , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Walter Kellermann,et al.  An Acoustic Human-Machine Front-End for Multimedia Applications , 2003, EURASIP J. Adv. Signal Process..

[10]  Yuki Denda,et al.  Noise-robust hands-free voice activity detection with adaptive zero crossing detection using talker direction estimation , 2007, INTERSPEECH.

[11]  Tetsuya Ogata,et al.  Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Zhao Li,et al.  GSC-based spatial voice activity detection for enhanced speech coding in the presence of competing speech , 2001, IEEE Trans. Speech Audio Process..

[13]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[14]  Jacob Benesty,et al.  Gaussian Model-Based Multichannel Speech Presence Probability , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Walter Kellermann,et al.  Artificial Neural Network-Based Feature Combination for Spatial Voice Activity Detection , 2016, INTERSPEECH.

[16]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[17]  Ilyas Potamitis,et al.  Speech activity detection and enhancement of a moving speaker based on the wideband generalized likelihood ratio and microphone arrays. , 2004, The Journal of the Acoustical Society of America.

[18]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[19]  Nam Ik Cho,et al.  Voice activity detection using the phase vector in microphone array , 2007, INTERSPEECH.

[20]  Boaz Rafaely,et al.  Design of Pseudo-Spherical Microphone Array with Extended Frequency Range for Robot Audition , 2016 .

[21]  Régine Le Bouquin-Jeannès,et al.  Study of a voice activity detector and its influence on a noise reduction system , 1995, Speech Commun..

[22]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[23]  Dongsuk Yook,et al.  Space-time voice activity detection , 2009, IEEE Transactions on Consumer Electronics.

[24]  Afsaneh Asaei,et al.  An integrated framework for multi-channel multi-source localization and voice activity detection , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[25]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Emanuel A. P. Habets,et al.  Noise Reduction in the Spherical Harmonic Domain Using a Tradeoff Beamformer and Narrowband DOA Estimates , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Walter Kellermann,et al.  Relative impulse response estimation during doubletalk with an artificial neural network-based step size control , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[28]  Gerhard Schmidt,et al.  Features for voice activity detection: a comparative analysis , 2015, EURASIP J. Adv. Signal Process..

[29]  Yuki Denda,et al.  Robust Talker Direction Estimation Based on Weighted CSP Analysis and Maximum Likelihood Estimation , 2006, IEICE Trans. Inf. Syst..