A Real-Time Speech Separation Method Based on Camera and Microphone Array Sensors Fusion Approach

In the context of assisted human, identifying and enhancing non-stationary speech targets speech in various noise environments, such as a cocktail party, is an important issue for real-time speech separation. Previous studies mostly used microphone signal processing to perform target speech separation and analysis, such as feature recognition through a large amount of training data and supervised machine learning. The method was suitable for stationary noise suppression, but relatively limited for non-stationary noise and difficult to meet the real-time processing requirement. In this study, we propose a real-time speech separation method based on an approach that combines an optical camera and a microphone array. The method was divided into two stages. Stage 1 used computer vision technology with the camera to detect and identify interest targets and evaluate source angles and distance. Stage 2 used beamforming technology with microphone array to enhance and separate the target speech sound. The asynchronous update function was utilized to integrate the beamforming control and speech processing to reduce the effect of the processing delay. The experimental results show that the noise reduction in various stationary and non-stationary noise environments were 6.1 dB and 5.2 dB respectively. The response time of speech processing was less than 10ms, which meets the requirements of a real-time system. The proposed method has high potential to be applied in auxiliary listening systems or machine language processing like intelligent personal assistant.

[1]  Zhigang Luo,et al.  Audio visual speech recognition with multimodal recurrent neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[2]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[3]  Laurent Mottron,et al.  Head Circumference in Canadian Male Adults: Development of a Normalized Chart , 2012 .

[4]  Martin Bouchard,et al.  Beamforming Designs Robust to Propagation Model Estimation Errors for Binaural Hearing Aids , 2019, IEEE Access.

[5]  Tomohiro Nakatani,et al.  Online meeting recognition in noisy environments with time-frequency mask based MVDR beamforming , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[6]  Jessica J. M. Monaghan,et al.  Tolerable delay for speech production and perception: effects of hearing ability and experience with hearing aids , 2018, International journal of audiology.

[7]  Zhihua Wang,et al.  A MVDR- MWF Combined Algorithm for Binaural Hearing Aid System , 2018, 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS).

[8]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[9]  Franz Pernkopf,et al.  DNN-based speech mask estimation for eigenvector beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Brian Gygi,et al.  Background sounds and hearing-aid users: A scoping review , 2016, International journal of audiology.

[11]  Jeffrey S. Shell,et al.  Designing for augmented attention: Towards a framework for attentive user interfaces , 2006, Comput. Hum. Behav..

[12]  Paris Smaragdis,et al.  Online PLCA for Real-Time Semi-supervised Source Separation , 2012, LVA/ICA.

[13]  Chengyu Liu,et al.  Development of Novel Hearing Aids by Using Image Recognition Technology , 2019, IEEE Journal of Biomedical and Health Informatics.

[14]  Chiman Kwan,et al.  Speech separation algorithms for multiple speaker environments , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[15]  Valentino Pacifici,et al.  Efficient distribution of visual processing tasks in multi-camera visual sensor networks , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[16]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[17]  M. Mojiri,et al.  Composition of digital all-pass lattice filter and gradient adaptive filter for amplitude and delay estimation of a sinusoid with unknown frequency , 2016, 2016 24th Iranian Conference on Electrical Engineering (ICEE).

[18]  Xiyu Song,et al.  Fast Estimation Method of Space-Time Two-Dimensional Positioning Parameters Based on Hadamard Product , 2018 .

[19]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[20]  Pieter Sijtsma,et al.  CLEAN Based on Spatial Source Coherence , 2007 .

[21]  Wen-Rong Wu,et al.  Three-Dimensional Positioning for LTE Systems , 2017, IEEE Transactions on Vehicular Technology.

[22]  Chengyu Liu,et al.  Design of Novel Field Programmable Gate Array-Based Hearing Aid , 2019, IEEE Access.

[23]  György Dán,et al.  Real-Time Distributed Visual Feature Extraction from Video in Sensor Networks , 2014, 2014 IEEE International Conference on Distributed Computing in Sensor Systems.

[24]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Shigang Cui,et al.  Generalized sidelobe canceller beamforming method for ultrasound imaging. , 2017, The Journal of the Acoustical Society of America.

[26]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[27]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Birger Kollmeier,et al.  Functionality of hearing aids: state-of-the-art and future model-based solutions , 2018, International journal of audiology.

[29]  Brian C. J. Moore,et al.  Tolerable Hearing Aid Delays. V. Estimation of Limits for Open Canal Fittings , 2008, Ear and hearing.

[30]  Marc Moonen,et al.  Methods of Extending a Generalized Sidelobe Canceller With External Microphones , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Tiong Sieh Kiong,et al.  Minimum Variance Distortionless Response Beamformer with Enhanced Nulling Level Control via Dynamic Mutated Artificial Immune System , 2014, TheScientificWorldJournal.

[32]  Jun Du,et al.  Online LSTM-based Iterative Mask Estimation for Multi-Channel Speech Enhancement and ASR , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[33]  Rainer Martin,et al.  Binaural Speaker Localization Integrated Into an Adaptive Beamformer for Hearing Aids , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  D. Markle,et al.  Hearing Aids , 1936, The Journal of Laryngology & Otology.

[35]  John R. Hershey,et al.  Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  L. Elliot Hong,et al.  Rapid Transformation from Auditory to Linguistic Representations of Continuous Speech , 2018, Current Biology.

[37]  Erin M Picou,et al.  An Evaluation of Hearing Aid Beamforming Microphone Arrays in a Noisy Laboratory Setting , 2019, Journal of the American Academy of Audiology.

[38]  Q. Leclère,et al.  A review of acoustic imaging methods using phased microphone arrays , 2019, CEAS Aeronautical Journal.

[39]  Angelo Cardellicchio,et al.  Real time Artificial Auditory Systems for cluttered environments , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[40]  Steen Østergaard Olsen,et al.  The Acceptable Noise Level and the Pure-Tone Audiogram. , 2016, American journal of audiology.