Image Processing Techniques for Segments Grouping in Monaural Speech Separation

Monaural speech separation is the process of separating the target speech from the noisy speech mixture recorded using single microphone. It is a challenging problem in speech signal processing, and recently, computational auditory scene analysis (CASA) finds a reasonable solution to solve this problem. This research work proposes an image analysis-based algorithm to enhance the binary T–F mask obtained in the initial segmentation stage of CASA-based monaural speech separation systems to improve the speech quality. The proposed algorithm consists of labeling the initial segmentation mask, boundary extraction, active pixel detection and finally eliminating the noisy non-active pixels. In labeling, the T–F mask obtained from the initial segmentation is labeled as periodicity pixel matrix and non-periodicity pixel matrix. Next boundaries are created by connecting all the possible nearby periodicity pixel matrix and non-periodicity pixel matrix as speech boundary. Some speech boundary may include noisy T–F units as holes, and these holes are treated using the proposed algorithm to properly classify them as the speech-dominant or noise-dominant T–F units in the active pixel detection process. Finally, the noisy T–F units are eliminated. The performance of the proposed algorithm is evaluated using TIMIT speech database. The experimental results show that the proposed algorithm improves the quality of the separated speech by increasing the signal-to-noise ratio by an average value of 9.64 dB and reduces the noise residue by 25.55% as compared to the noisy speech mixture.

[1]  Guy J. Brown,et al.  Separation of Speech by Computational Auditory Scene Analysis , 2005 .

[2]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[3]  DeLiang Wang,et al.  Unvoiced Speech Segregation From Nonspeech Interference via CASA and Spectral Subtraction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[5]  Ganesh R. Naik,et al.  Measure of Quality of Source Separation for Sub- and Super-Gaussian Audio Mixtures , 2012, Informatica.

[6]  Hamid Sheikhzadeh,et al.  HMM-based strategies for enhancement of speech signals embedded in nonstationary noise , 1998, IEEE Trans. Speech Audio Process..

[7]  E. Oja,et al.  Independent Component Analysis , 2013 .

[8]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Ganesh R. Naik,et al.  Enhanced Forensic Speaker Verification Using a Combination of DWT and MFCC Feature Warping in the Presence of Noise and Reverberation Conditions , 2017, IEEE Access.

[10]  M. Dharmalingam,et al.  Optimizing the Objective Measure of Speech Quality in Monaural Speech Separation , 2016 .

[11]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[13]  Chen Ning,et al.  Improved monaural speech segregation based on computational auditory scene analysis , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[14]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[15]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[16]  R. Rajavel,et al.  Monaural speech separation system based on optimum soft mask , 2014, 2014 IEEE International Conference on Computational Intelligence and Computing Research.

[17]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[18]  Ganesh R. Naik,et al.  Audio analysis of statistically instantaneous signals with mixed Gaussian probability distributions , 2012 .

[19]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[20]  DeLiang Wang,et al.  Towards Generalizing Classification Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[22]  S. Shoba,et al.  Adaptive energy threshold for monaural speech separation , 2017, 2017 International Conference on Communication and Signal Processing (ICCSP).

[23]  Adam Bednar,et al.  Different spatio‐temporal electroencephalography features drive the successful decoding of binaural and monaural cues for sound localization , 2017, The European journal of neuroscience.

[24]  Klaus Obermayer,et al.  Robust Detection of Environmental Sounds in Binaural Auditory Scenes , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Jean Rouat,et al.  A Quantitative Evaluation of a Bio-inspired Sound Segregation Technique for Two- and Three-Source Mixtures , 2004, Summer School on Neural Networks.

[26]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.