Robust speech recognition using missing feature theory and target speech enhancement based on degenerate unmixing and estimation technique

A method for target speech enhancement based on degenerate unmixing and estimating technique (DUET) has been described. To avoid the requirements of the DUET which need to know the number of sources in advance and to estimate the attenuation and delay parameters for all sources, the method assumes that extraction of only one target signal is required, which is often plausible in real-world applications such as speech enhancement. The method can efficiently recover the target speech with fast convergence by estimating the parameters for the target source only. In addition, it does not need to know the number of sources in advance. In order to accomplish robust speech recognition, we propose an algorithm which employs the cluster-based missing feature reconstruction technique based on log-spectral features of enhanced speech in the process of extracting mel-frequency cepstral coefficients (MFCCs). The algorithm estimates missing time-frequency regions by computing the signal-to-noise ratios (SNRs) from the log-spectral features of the enhanced speech and observed noisy speech and by finding time-frequency segments which have the SNRs smaller than a threshold. The missing time-frequency regions are filled by using bounded estimation based on the log-spectral features that are considered to be reliable and on the knowledge of the log-spectral feature cluster to which the incoming target speech is assumed to belong. Then, the log-spectral features are transformed into cepstral features in the usual fashion of extracting MFCCs. Experimental results show that the proposed algorithm significantly improves recognition performance in noisy environments.

[1]  Abbas Mohammed,et al.  Blind Source Separation Using Time-Frequency Masking , 2007 .

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[4]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[5]  Richard M. Stern,et al.  Model Compensation and Matched Condition Methods for Robust Speech Recognition , 2002, Noise Reduction in Speech Applications.

[6]  Meir Feder,et al.  Multi-channel signal separation by decorrelation , 1993, IEEE Trans. Speech Audio Process..

[7]  Richard M. Stern,et al.  The effects of background music on speech recognition accuracy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Sang-Hoon Oh,et al.  A Bark-scale filter bank approach to independent component analysis for acoustic mixtures , 2009, Neurocomputing.

[9]  Hyung-Min Park,et al.  Target speech enhancement based on degenerate unmixing and estimation technique for real-world applications (Speech and audio processing and translation) , 2010 .

[10]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[12]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[13]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .