Exploiting CNNs for Improving Acoustic Source Localization in Noisy and Reverberant Conditions

This paper discusses the application of convolutional neural networks (CNNs) to minimum variance distortionless response localization schemes. We investigate the direction of arrival estimation problems in noisy and reverberant conditions using a uniform linear array (ULA). CNNs are used to process the multichannel data from the ULA and to improve the data fusion scheme, which is performed in the steered response power computation. CNNs improve the incoherent frequency fusion of the narrowband response power by weighting the components, reducing the deleterious effects of those components affected by artifacts due to noise and reverberation. The use of CNNs avoids the necessity of previously encoding the multichannel data into selected acoustic cues with the advantage to exploit its ability in recognizing geometrical pattern similarity. Experiments with both simulated and real acoustic data demonstrate the superior localization performance of the proposed SRP beamformer with respect to other state-of-the-art techniques.

[1]  Walter Kellermann,et al.  EB-ESPRIT: 2D localization of multiple wideband acoustic sources using eigen-beams , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[3]  Zhengyou Zhang,et al.  Maximum Likelihood Sound Source Localization and Beamforming for Directional Microphone Arrays in Distributed Meetings , 2008, IEEE Transactions on Multimedia.

[4]  Sharon Gannot,et al.  Semi-Supervised Sound Source Localization Based on Manifold Regularization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Gaël Richard,et al.  Robust Downbeat Tracking Using an Ensemble of Convolutional Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[9]  Bhaskar D. Rao,et al.  Performance analysis of Root-Music , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Lucian Petrica,et al.  An evaluation of low-power microphone array sound source localization for deforestation detection , 2016 .

[11]  E. Lehmann,et al.  Prediction of energy decay in room impulse responses simulated with an image-source model. , 2008, The Journal of the Acoustical Society of America.

[12]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[13]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Gian Luca Foresti,et al.  A microphone array Interface for Real-Time Interactive Music Performance , 2012, ICMC.

[15]  Jacob Benesty,et al.  Real-time passive source localization: a practical linear-correction least-squares approach , 2001, IEEE Trans. Speech Audio Process..

[16]  Carlo Drioli,et al.  A weighted MVDR beamformer based on SVM learning for sound source localization , 2016, Pattern Recognit. Lett..

[17]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[18]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[19]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[20]  Cha Zhang,et al.  Using Reverberation to Improve Range and Elevation Discrimination for Small Array Sound Source Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Jacob Benesty,et al.  Robust time delay estimation exploiting redundancy among multiple microphones , 2003, IEEE Trans. Speech Audio Process..

[22]  Paris Smaragdis,et al.  Robust Source Localization and Enhancement With a Probabilistic Steered Response Power Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[25]  Jeffrey L. Krolik,et al.  Relationships between adaptive minimum variance beamforming and optimal source localization , 2000, IEEE Trans. Signal Process..

[26]  Taewoo Lee,et al.  Fast Sound Source Localization Using Two-Level Search Space Clustering , 2016, IEEE Transactions on Cybernetics.

[27]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[28]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[31]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[32]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[33]  A. Austeng,et al.  Adaptive Beamforming Applied to Medical Ultrasound Imaging , 2007, IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control.

[34]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[35]  Benesty Adaptive eigenvalue decomposition algorithm for passive acoustic source localization , 2000, The Journal of the Acoustical Society of America.

[36]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Daniel C. Marcus,et al.  46 – Acoustic Transduction , 2001 .

[38]  Carlo Drioli,et al.  On the use of machine learning in microphone array beamforming for far-field sound source localization , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[39]  Peter Stoica,et al.  Source localization from range-difference measurements , 2006 .

[40]  Carlo Drioli,et al.  Sound Source and Microphone Localization From Acoustic Impulse Responses , 2016, IEEE Signal Processing Letters.

[41]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[42]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[43]  Ba-Ngu Vo,et al.  Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach , 2006, IEEE Transactions on Signal Processing.

[44]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Walter Kellermann,et al.  TDOA Estimation for Multiple Sound Sources in Noisy and Reverberant Environments Using Broadband Independent Component Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Ji-Won Cho,et al.  DNN-Based Feature Enhancement Using DOA-Constrained ICA for Robust Speech Recognition , 2016, IEEE Signal Processing Letters.

[48]  Daniele Salvati,et al.  Incident Signal Power Comparison for Localization of Concurrent Multiple Acoustic Sources , 2014, TheScientificWorldJournal.

[49]  Carlo Drioli,et al.  Incoherent Frequency Fusion for Broadband Steered Response Power Algorithms in Noisy Environments , 2014, IEEE Signal Processing Letters.

[50]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[51]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[52]  Carlo Drioli,et al.  Frequency map selection using a RBFN-based classifier in the MVDR beamformer for speaker localization in reverberant rooms , 2015, INTERSPEECH.

[53]  Sergio Canazza,et al.  Adaptive Time Delay Estimation Using Filter Length Constraints for Source Localization in Reverberant Acoustic Environments , 2013, IEEE Signal Processing Letters.

[54]  Xiangui Kang,et al.  Audio Recapture Detection With Convolutional Neural Networks , 2016, IEEE Transactions on Multimedia.

[55]  Rajesh M. Hegde,et al.  Near-Field Acoustic Source Localization and Beamforming in Spherical Harmonics Domain , 2016, IEEE Transactions on Signal Processing.

[56]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[57]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[59]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[60]  Sergios Theodoridis,et al.  Introduction to Pattern Recognition: A Matlab Approach , 2010 .

[61]  Bin Yang,et al.  Disambiguation of TDOA Estimation for Multiple Sources in Reverberant Environments , 2008, IEEE Transactions on Audio, Speech, and Language Processing.