Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments

Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated using material produced in controlled indoor environments, or the audio is simulated by mixing isolated sounds to different spatial locations. This paper studies SELD of speech in diverse everyday environments, where the audio corresponds to typical usage scenarios of handheld mobile devices. In order to allow weighting the relative importance of localization vs. detection, we will propose a two-stage hierarchical system, where the first stage is to detect the target events, and the second stage is to localize them. The proposed method utilizes convolutional recurrent neural network (CRNN) and is evaluated on a database of manually annotated microphone array recordings from various acoustic conditions. The array is embedded in a contemporary mobile phone form factor. The obtained results show good speech detection and localization accuracy of the proposed method in contrast to a non-hierarchical flat classification model.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Javier Macías Guarasa,et al.  Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates , 2018, Sensors.

[5]  Pasi Pertilä,et al.  Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[8]  Luc Van Gool,et al.  Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016, ArXiv.

[9]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[10]  Patrick A. Naylor,et al.  The LOCATA Challenge: Acoustic Source Localization and Tracking , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[13]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[14]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[15]  Carlo Drioli,et al.  Exploiting CNNs for Improving Acoustic Source Localization in Noisy and Reverberant Conditions , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[16]  Emmanuel Vincent,et al.  CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[17]  Dorothea Kolossa,et al.  Joining Sound Event Detection and Localization Through Spatial Segregation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Tetsuya Ogata,et al.  Sound Source Localization Using Deep Learning Models , 2017, J. Robotics Mechatronics.

[19]  Mikko Parviainen,et al.  Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Petr Motlícek,et al.  Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network , 2018, INTERSPEECH.

[21]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[22]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[23]  Mark B. Sandler,et al.  Database of omnidirectional and B-format room impulse responses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Archontis Politis,et al.  Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019 , 2020, IEEE/ACM Transactions on Audio Speech and Language Processing.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[27]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Douglas L. Jones,et al.  Robust Source Counting and DOA Estimation Using Spatial Pseudo-Spectrum and Convolutional Neural Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Francesco Piazza,et al.  Localizing speakers in multiple rooms by using Deep Neural Networks , 2018, Comput. Speech Lang..

[30]  Zhong-Qiu Wang,et al.  Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks , 2018, INTERSPEECH.

[31]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[32]  Mark D. Plumbley,et al.  Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy , 2019, DCASE.

[33]  Toni Hirvonen,et al.  Classification of Spatial Audio Location and Content Using Convolutional Neural Networks , 2015 .

[34]  Jean Rouat,et al.  SECL-UMons Database for Sound Event Classification and Localization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .