Exploiting Periodicity Features for Joint Detection and DOA Estimation of Speech Sources Using Convolutional Neural Networks

While many algorithms deal with direction of arrival (DOA) estimation and voice activity detection (VAD) as two separate tasks, only a small number of data-driven methods have addressed these two tasks jointly. In this paper, a multi-input single-output convolutional neural network (CNN) is proposed which exploits a novel feature combination for joint DOA estimation and VAD in the context of binaural hearing aids. In addition to the well-known generalized cross correlation with phase transform (GCC-PHAT) feature, the network uses an auditory-inspired feature called periodicity degree (PD), which provides a broadband representation of the periodic structure of the signal. The proposed CNN has been trained in a multi-conditional training scheme across different signal-to-noise ratios. Experimental results for a single-talker scenario in reverberant environments show that by exploiting the PD feature, the proposed CNN is able to distinguish speech from non-speech signal blocks, thereby outperforming the baseline CNN in terms of DOA estimation accuracy. In addition, the results show that the proposed method is able to adapt to different unseen acoustic conditions and background noises.

[1]  Emanuel A. P. Habets,et al.  Narrowband direction-of-arrival estimation for binaural hearing aids using relative transfer functions , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[4]  Volker Hohmann,et al.  Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[6]  Patrick A. Naylor,et al.  Locata Challenge-Evaluation Tasks and Measures , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[7]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[8]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Francesco Piazza,et al.  Deep Neural Networks for Joint Voice Activity Detection and Speaker Localization , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[11]  Jörn Anemüller,et al.  A discriminative learning approach to probabilistic acoustic source localization , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[12]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Emanuel A. P. Habets,et al.  DOA-informed source extraction in the presence of competing talkers and background noise , 2017, EURASIP J. Adv. Signal Process..

[14]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[15]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[16]  Daniel Marquardt,et al.  Noise power spectral density estimation for binaural noise reduction exploiting direction of arrival estimates , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[17]  Gerhard Schmidt,et al.  Features for voice activity detection: a comparative analysis , 2015, EURASIP J. Adv. Signal Process..

[18]  Guy J. Brown,et al.  Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Stefan B. Williams,et al.  Sound Source Localization in a Multipath Environment Using Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Volker Hohmann,et al.  A Binaural Steering Beamformer System for Enhancing a Moving Speech Source , 2015, Trends in hearing.

[21]  Stefano Squartini,et al.  Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation , 2019, Expert Syst. Appl..

[22]  Jörn ANEMÜLLER,et al.  Deep network source localization and the influence of sensor geometry , 2019 .

[23]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[24]  Sergiy A. Vorobyov,et al.  Maximum likelihood direction-of-arrival estimation in unknown noise fields using sparse sensor arrays , 2005, IEEE Transactions on Signal Processing.

[25]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[26]  Volker Hohmann,et al.  Modeling speech localization, talker identification, and word recognition in a multi-talker setting. , 2017, The Journal of the Acoustical Society of America.