A Deep Learning Framework for Robust DOA Estimation Using Spherical Harmonic Decomposition

Spherical harmonic decomposition facilitates decomposing the sound pressure at different microphones into independent functions of frequency, azimuth and elevation of the source and microphone locations. This decomposition facilitates the extraction of two sets of features containing different information about elevation and azimuth of the source for the direction of arrival (DOA) estimation. These features can be given as input to a learning approach for the estimation of azimuth and elevation separately. This approach aims at breaking down the problem of DOA estimation into azimuth and elevation estimation separately. An advantage of this is the reduction in computational complexity when compared with the joint DOA estimation. This facilitates a straightforward extension of this approach to denser DOA search grids. The contribution of this paper is threefold. First, we propose spherical harmonic magnitude and phase features and discuss the information present in these features regarding the azimuth and elevation of the source. Second, we propose the convolutional neural network architectures for DOA estimation. Third, we analyse the training, run-time computational complexities and propose to extend the DOA estimation approach to dense DOA search grid rather than restricting to a sparse DOA search grid. The performance of conventional DOA estimation approaches degrades in case of a noisy and reverberant environment. Several advancements to the existing DOA estimation approaches have been recently proposed. However, to the best of the authors’ knowledge, learning approaches to DOA estimation with dense DOA search grids with few frames in the context of spherical arrays have not been proposed. Performance evaluation is carried out using simulated as well as real datasets. The proposed approach is also evaluated on LOCATA dataset in the context of a moving source. The results are motivating enough to consider the application of the proposed method in practical scenarios.

[1]  Kristian Kroschel,et al.  Speaker tracking with a microphone array using Kalman filtering , 2003 .

[2]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[3]  Guisheng Liao,et al.  A fast algorithm for 2-D direction-of-arrival estimation , 2003, Signal Process..

[4]  Hyuck M. Kwon,et al.  Azimuth and elevation angle estimation with no failure and no eigen decomposition , 2006, Signal Process..

[5]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Gary W. Elko,et al.  A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Israel Cohen,et al.  On Multiplicative Transfer Function Approximation in the Short-Time Fourier Transform Domain , 2007, IEEE Signal Processing Letters.

[9]  Rajesh Hegde,et al.  Robust online direction of arrival estimation using low dimensional spherical harmonic features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Zhengyou Zhang,et al.  Why does PHAT work well in lownoise, reverberative environments? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Walter Kellermann,et al.  Comparison of subspace-based and steered beamformer-based reflection localization methods , 2011, 2011 19th European Signal Processing Conference.

[13]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[14]  Alex B. Gershman,et al.  Direction-of-Arrival Estimation for Nonuniform Sensor Arrays: From Manifold Separation to Fourier Domain MUSIC Methods , 2009, IEEE Transactions on Signal Processing.

[15]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..

[16]  Walter Kellermann,et al.  Joint DOA and TDOA estimation for 3D localization of reflective surfaces using eigenbeam MVDR and spherical microphone arrays , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Patrick A. Naylor,et al.  The LOCATA Challenge Data Corpus for Acoustic Source Localization and Tracking , 2018, 2018 IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM).

[19]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[20]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[21]  Boaz Rafaely,et al.  Analysis and design of spherical microphone arrays , 2005, IEEE Transactions on Speech and Audio Processing.

[22]  Rajesh M. Hegde,et al.  Near-Field Acoustic Source Localization and Beamforming in Spherical Harmonics Domain , 2016, IEEE Transactions on Signal Processing.

[23]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[24]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Petros G. Voulgaris,et al.  On optimal ℓ∞ to ℓ∞ filtering , 1995, Autom..

[26]  G. Arfken Mathematical Methods for Physicists , 1967 .

[27]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[28]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Carolyn Davis,et al.  Sound system engineering , 1975 .

[30]  Rajesh M. Hegde,et al.  Near-Field Acoustic Source Localization Using Spherical Harmonic Features , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Thushara D. Abhayapala,et al.  Theory and design of high order sound field microphones using spherical microphone array , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[33]  Dorothea Kolossa,et al.  Speaker localization in reverberant rooms based on direct path dominance test statistics , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Ed F. Deprettere,et al.  Azimuth and elevation computation in high resolution DOA estimation , 1992, IEEE Trans. Signal Process..

[36]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[37]  Guoan Bi,et al.  The spherical harmonics root-music , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Jerry M. Mendel,et al.  Azimuth and elevation direction finding using arbitrary array geometries , 1998, IEEE Trans. Signal Process..

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Heaton T. Jeff,et al.  Introduction to Neural Networks with Java , 2005 .

[41]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[42]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[43]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[44]  Hai Liu,et al.  Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.

[45]  Patrick A. Naylor,et al.  The LOCATA Challenge: Acoustic Source Localization and Tracking , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  Boaz Rafaely,et al.  Localization of Multiple Speakers under High Reverberation using a Spherical Microphone Array and the Direct-Path Dominance Test , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Sharon Gannot,et al.  Semi-Supervised Sound Source Localization Based on Manifold Regularization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Boaz Rafaely,et al.  Open-Sphere Designs for Spherical Microphone Arrays , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[50]  Boaz Rafaely,et al.  Description of algorithms for Ben-Gurion University Submission to the LOCATA challenge , 2018, ArXiv.

[51]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[52]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[54]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  D P Jarrett,et al.  Rigid sphere room impulse response simulation: algorithm and applications. , 2012, The Journal of the Acoustical Society of America.

[57]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[58]  Shih-Fu Chang,et al.  An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Fredrik Athley,et al.  Performance analysis of DOA estimation in the threshold region , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  Minsoo Hahn,et al.  Voice Activity Detection Using an Adaptive Context Attention Model , 2018, IEEE Signal Processing Letters.

[61]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[63]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.