A Regression Approach to Speech Source Localization Exploiting Deep Neural Network

This paper presents a data-driven framework to speech source localization (SSL) using deep neural network (DNN), which directly construct the nonlinear regressive transform between the extracted feature and the direction-of-arrival (DOA) of indoor speech source. The proposed method incorporates a feature extractor front-end and a regression network back-end. First, since the DOA information contained in the steering vector of speech source can be represented by the eigenvector associated with the signal subspace, it is extracted as the input feature by eigenanalysis. Second, a regression DNN is adopted to model the nonlinear relationship between the eigenvector and source direction, where time delay neural network (TDNN) is chosen as the basic network architecture. Several experiments are conducted under the simulated and real environments using an eight-channel circular array, which reveal the superiority and potential of the proposed method for SSL.

[1]  Kazunori Komatani,et al.  Unsupervised adaptation of deep neural networks for sound source localization using entropy minimization , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jae Choi,et al.  Robust Time-Delay Estimation for Acoustic Indoor Localization in Reverberant Environments , 2017, IEEE Signal Processing Letters.

[4]  Wen-Jun Zeng,et al.  High-Resolution Multiple Wideband and Nonstationary Source Localization With Unknown Number of Sources , 2010, IEEE Transactions on Signal Processing.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[7]  Zhengyou Zhang,et al.  Maximum Likelihood Sound Source Localization and Beamforming for Directional Microphone Arrays in Distributed Meetings , 2008, IEEE Transactions on Multimedia.

[8]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[9]  Sven Nordholm,et al.  Multichannel Signal Enhancement Algorithms for Assisted Listening Devices: Exploiting spatial diversity using multiple microphones , 2015, IEEE Signal Processing Magazine.

[10]  Douglas L. Jones,et al.  Localization of multiple acoustic sources with small arrays using a coherence test. , 2008, The Journal of the Acoustical Society of America.

[11]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Maximo Cobos,et al.  Robust acoustic source localization based on modal beamforming and time-frequency processing using circular microphone arrays. , 2012, The Journal of the Acoustical Society of America.

[13]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..

[14]  Ying Yu,et al.  A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  T. Kailath,et al.  Spatio-temporal spectral analysis by eigenstructure methods , 1984 .

[16]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[17]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[18]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[19]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[20]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[21]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[23]  Yonghong Yan,et al.  Robust multiple speech source localization using time delay histogram , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Hong Wang,et al.  Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources , 1985, IEEE Trans. Acoust. Speech Signal Process..

[26]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[27]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[28]  Sven Nordholm,et al.  Robust Source Localization in Reverberant Environments Based on Weighted Fuzzy Clustering , 2009, IEEE Signal Processing Letters.

[29]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .