Sound Localization Based on Phase Difference Enhancement Using Deep Neural Networks

The performance of most of the classical sound source localization algorithms degrades seriously in the presence of background noise or reverberation. Recently, deep neural networks (DNNs) have successfully been applied to sound source localization, which mainly aim to classify the direction-of-arrival (DoA) into one of the candidate sectors. In this paper, we propose a DNN-based phase difference enhancement for DoA estimation, which turned out to be better than the direct estimation of the DoAs from the input interchannel phase differences (IPDs). The sinusoidal functions of the phase differences for “clean and dry” source signals are estimated from the sinusoidal functions of the IPDs for the input signals, which may include directional signals, diffuse noise, and reverberation. The resulted DoA is further refined to compensate for the estimation bias near the end-fire directions. From the enhanced IPDs, we can determine the DoA for each frequency bin and the DoAs for the current frame from the distributions of the DoAs for frequencies. Experimental results with various types and levels of background noise, reverberation times, numbers of sources, room impulse responses, and DoAs showed that the proposed method outperformed conventional approaches.

[1]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[4]  Yong Xu,et al.  Binaural and log-power spectra features with deep neural networks for speech-noise separation , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[5]  Thomas Hofmann,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2007 .

[6]  Yoshifumi Kitamura,et al.  Unsupervised Adaptation of Neural Networks for Discriminative Sound Source Localization with Eliminative Constraint , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[8]  Guy J. Brown,et al.  Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Guy J. Brown,et al.  Speech Localisation in a Multitalker Mixture by Humans and Machines , 2016, INTERSPEECH.

[10]  Nam Soo Kim,et al.  Decision-directed speech power spectral density matrix estimation for multichannel speech enhancement. , 2017, The Journal of the Acoustical Society of America.

[11]  Alastair H. Moore,et al.  Direction of Arrival Estimation in the Spherical Harmonic Domain Using Subspace Pseudointensity Vectors , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Nam Soo Kim,et al.  NMF-based Target Source Separation Using Deep Neural Network , 2015, IEEE Signal Processing Letters.

[13]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[14]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[15]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Joseph Tabrikian,et al.  A New Class of Bayesian Cyclic Bounds for Periodic Parameter Estimation , 2016, IEEE Transactions on Signal Processing.

[17]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[18]  Hyung-Min Park,et al.  Multiple Reverberant Sound Localization Based on Rigorous Zero-Crossing-Based ITD Selection , 2010, IEEE Signal Processing Letters.

[19]  Alastair H. Moore,et al.  Direction of arrival estimation using pseudo-intensity vectors with direct-path dominance test , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[20]  Guy J. Brown,et al.  Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localization of Multiple Sources in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Tatsuya Kawahara,et al.  Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Bhaskar D. Rao,et al.  A Two Microphone-Based Approach for Source Localization of Multiple Speech Sources , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Sina Hafezi,et al.  Augmented Intensity Vectors for Direction of Arrival Estimation in the Spherical Harmonic Domain , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  DeLiang Wang,et al.  Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Joseph Tabrikian,et al.  Bayesian Parameter Estimation Using Periodic Cost Functions , 2012, IEEE Transactions on Signal Processing.

[26]  Patrick A. Naylor,et al.  The LOCATA Challenge Data Corpus for Acoustic Source Localization and Tracking , 2018, 2018 IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM).

[27]  Peter Vary,et al.  Multichannel speech enhancement using Bayesian spectral amplitude estimation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[28]  Maximo Cobos,et al.  Two-microphone multi-speaker localization based on a Laplacian Mixture Model , 2011, Digit. Signal Process..

[29]  Stephan Gerlach,et al.  On sound source localization of speech signals using deep neural networks , 2015 .

[30]  Chanwoo Kim,et al.  Sound source separation algorithm using phase difference and angle distribution modeling near the target , 2015, INTERSPEECH.

[31]  Yang Yu,et al.  Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks , 2016, EURASIP J. Audio Speech Music. Process..

[32]  Hyung-Min Park,et al.  Non-stationary sound source localization based on zero crossings with the detection of onset intervals , 2008, IEICE Electron. Express.

[33]  Kazunori Komatani,et al.  Unsupervised adaptation of deep neural networks for sound source localization using entropy minimization , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Andreas Stolcke,et al.  Making themost from multiple microphones in meeting recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Shengkui Zhao,et al.  Robust DOA estimation of multiple speech sources , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Yong Xu,et al.  Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks , 2018, LVA/ICA.

[39]  Nam Soo Kim,et al.  Spectro-Temporal Filtering for Multichannel Speech Enhancement in Short-Time Fourier Transform Domain , 2014, IEEE Signal Processing Letters.

[40]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Hiroshi Sawada,et al.  Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors , 2007, Signal Process..

[42]  Yu Gwang Jin,et al.  Multichannel speech reinforcement based on binaural unmasking , 2017, Signal Process..

[43]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Israel Cohen,et al.  Multichannel speech enhancement using convolutive transfer function approximation in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Hiroshi Sawada,et al.  Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with Dirichlet prior , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[47]  Angeliki Xenaki,et al.  Sound source localization and speech enhancement with sparse Bayesian learning beamforming. , 2018, The Journal of the Acoustical Society of America.

[48]  Dong Yu,et al.  Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  I. Cohen,et al.  Generating nonstationary multisensor signals under a spatial coherence constraint. , 2008, The Journal of the Acoustical Society of America.

[50]  Sina Hafezi,et al.  Multiple source localization using Estimation Consistency in the Time-Frequency domain , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Rhee Man Kil,et al.  Estimation of Interaural Time Differences Based on Zero-Crossings in Noisy Multisource Environments , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Jie Huang,et al.  Sound localization in reverberant environment based on the model of the precedence effect , 1997 .

[54]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[56]  Thomas Hain,et al.  An Analysis of Automatic Speech Recognition with Multiple Microphones , 2011, INTERSPEECH.

[57]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .