Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

Most speech enhancement algorithms make use of the short-time Fourier transform (STFT), which is a simple and flexible time-frequency decomposition that estimates the short-time spectrum of a signal. However, the duration of short STFT frames are inherently limited by the nonstationarity of speech signals. The main contribution of this paper is a demonstration of speech enhancement and automatic speech recognition in the presence of reverberation and noise by extending the length of analysis windows. We accomplish this extension by performing enhancement in the short-time fan-chirp transform (STFChT) domain, an overcomplete time-frequency representation that is coherent with speech signals over longer analysis window durations than the STFT. This extended coherence is gained by using a linear model of fundamental frequency variation of voiced speech signals. Our approach centers around using a single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator proposed by Habets, which scales coefficients in a time-frequency domain to suppress noise and reverberation. In the case of multiple microphones, we preprocess the data with either a minimum variance distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB). We evaluate our algorithm on both speech enhancement and recognition tasks for the REVERB challenge dataset. Compared to the same processing done in the STFT domain, our approach achieves significant improvement in terms of objective enhancement metrics (including PESQ---the ITU-T standard measurement for speech quality). In terms of automatic speech recognition (ASR) performance as measured by word error rate (WER), our experiments indicate that the STFT with a long window is more effective for ASR.

[1]  Martín Rocamora,et al.  FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION , 2010 .

[2]  Luis Weruaga,et al.  Speech analysis with the fast chirp transform , 2004, 2004 12th European Signal Processing Conference.

[3]  Yannis Stylianou,et al.  Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tomohiro Nakatani,et al.  Harmonicity-Based Blind Dereverberation for Single-Channel Speech Signals , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Thomas F. Quatieri,et al.  Sinewave Analysis/Synthesis Based on the Fan-Chirp Tranform , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[8]  I. Cohen Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator , 2002, IEEE Signal Processing Letters.

[9]  Robert B. Dunn,et al.  Sinewave parameter estimation using the fast Fan-Chirp Transform , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[10]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[11]  Alex Acero,et al.  A fine pitch model for speech , 2007, INTERSPEECH.

[12]  Yannis Stylianou,et al.  Adaptive AM–FM Signal Decomposition With Application to Speech Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Les E. Atlas,et al.  Extending coherence for optimal detection of nonstationary harmonic signals , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[14]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[15]  Emanuel A. P. Habets,et al.  Speech Dereverberation Using Statistical Reverberation Models , 2010, Speech Dereverberation.

[16]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[17]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[18]  Mark J. F. Gales,et al.  Impact of single-microphone dereverberation on DNN-based meeting transcription systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Emanuel A. P. Habets,et al.  On the application of reverberation suppression to robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Les E. Atlas,et al.  Extending coherence time for analysis of modulated random processes , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[22]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[23]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[24]  Peter Vary,et al.  An Improved Algorithm for Blind Reverberation Time Estimation , 2010 .

[25]  Luis Weruaga,et al.  Adaptive chirp-based time-frequency analysis of speech signals , 2006, Speech Commun..

[26]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[27]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[28]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[29]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[30]  Israel Cohen,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2003, IEEE Transactions on Speech and Audio Processing.

[31]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[32]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[33]  Luis Weruaga,et al.  The fan-chirp transform for non-stationary harmonic signals , 2007, Signal Process..

[34]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..