Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition

The REMOS (REverberation MOdeling for Speech recognition) concept for reverberation-robust distant-talking speech recognition, introduced in “Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain” (A. Sehr , in Proc. Interspeech, 2006, pp. 769-772) for melspectral features, is extended to logarithmic melspectral (logmelspec) features in this contribution. Thus, the favorable properties of REMOS, including its high flexibility with respect to changing reverberation conditions, become available in the more competitive logmelspec domain. Based on a combined acoustic model consisting of a hidden Markov model (HMM) network and a reverberation model (RM), REMOS determines clean-speech and reverberation estimates during recognition. Therefore, in each iteration of a modified Viterbi algorithm, an inner optimization operation maximizes the joint density of the current HMM output and the RM output subject to the constraint that their combination is equal to the current reverberant observation. Since the combination operation in the logmelspec domain is nonlinear, numerical methods appear necessary for solving the constrained inner optimization problem. A novel reformulation of the constraint, which allows for an efficient solution by nonlinear optimization algorithms, is derived in this paper so that a practicable implementation of REMOS for logmelspec features becomes possible. An in-depth analysis of this REMOS implementation investigates the statistical properties of its reverberation estimates and thus derives possibilities for further improving the performance of REMOS. Connected digit recognition experiments show that the proposed REMOS version in the logmelspec domain significantly outperforms the melspec version. While the proposed RMs with parameters estimated by straightforward training for a given room are robust to a mismatch of the speaker-microphone distance, their performance significantly decreases if they are used in a room with substantially different conditions. However, by training multi-style RMs with data from several rooms, good performance can be achieved across different rooms.

[1]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[4]  Roger Fletcher,et al.  Practical methods of optimization; (2nd ed.) , 1987 .

[5]  R. Fletcher Practical Methods of Optimization , 1988 .

[6]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[7]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[9]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Hynek Hermansky,et al.  Multiresolution channel normalization for ASR in reverberant environments , 1997, EUROSPEECH.

[11]  Satoshi Nakamura,et al.  Sound Scene Database in Real Acoustical Environments, Proc. First International Workshop on East-Asian Language Resource and Evaluation , 1998 .

[12]  Maurizio Omologo,et al.  Training of HMM with filtered speech material for hands-free recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  R. Fletcher,et al.  Practical Methods of Optimization: Fletcher/Practical Methods of Optimization , 2000 .

[14]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[15]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[16]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[17]  Alexander Fischer,et al.  Acoustic synthesis of training data for speech recognition in living room environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[19]  D. Ward,et al.  ON THE USE OF LINEAR PREDICTION FOR DEREVERBERATION OF SPEECH , 2003 .

[20]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[21]  Lorenz T. Biegler,et al.  On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming , 2006, Math. Program..

[22]  Shigeki Sagayama,et al.  Model Adaptation for Long Convolutional Distortion by Maximum Likelihood Based State Filtering Approach , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Hans-Günter Hirsch,et al.  A new HMM adaptation approach for the case of a hands-free speech input in reverberant rooms , 2006, INTERSPEECH.

[24]  Masafumi Nishimura,et al.  Acoustic Model Adaptation Using First-Order Linear Prediction for Reverberant Speech , 2006, IEICE Trans. Inf. Syst..

[25]  Walter Kellermann,et al.  Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain , 2006, INTERSPEECH.

[26]  Marc Delcroix,et al.  Inverse Filtering for Speech Dereverberation Less Sensitive to Noise and Room Transfer Function Fluctuations , 2007, EURASIP J. Adv. Signal Process..

[27]  Elmar Nöth,et al.  Maximum likelihood estimation of a reverberation model for robust distant-talking speech recognition , 2007, 2007 15th European Signal Processing Conference.

[28]  Rüdiger Hoffmann,et al.  The harming part of room acoustics in automatic speech recognition , 2007, INTERSPEECH.

[29]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[30]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Maximum-Likelihood Estimation With Time-Varying Gaussian Source Model , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  A. Sehr A SIMPLIFIED DECODING METHOD FOR A ROBUST DISTANT-TALKING ASR CONCEPT BASED ON FEATURE-DOMAIN DEREVERBERATION , 2008 .

[32]  Walter Kellermann,et al.  Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments , 2008 .

[33]  W. Kellermann,et al.  New Results for Feature-Domain Reverberation Modeling , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[34]  A. Sehr,et al.  A COMBINED APPROACH FOR ESTIMATING A FEATURE-DOMAIN REVERBERATION MODEL IN NON-DIFFUSE ENVIRONMENTS , 2008 .

[35]  Patrick A. Naylor,et al.  Blind estimation of a feature-domain reverberation model in non-diffuse environments with variance adjustment , 2009, 2009 17th European Signal Processing Conference.

[36]  Shinji Watanabe,et al.  Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Emanuel A. P. Habets,et al.  Late Reverberant Spectral Variance Estimation Based on a Statistical Model , 2009, IEEE Signal Processing Letters.

[38]  Matthias Wölfel,et al.  Enhanced Speech Features by Single-Channel Joint Compensation of Noise and Reverberation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Tomohiro Nakatani,et al.  Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Walter Kellermann,et al.  TRINICON for Dereverberation of Speech and Audio Signals , 2010, Speech Dereverberation.

[41]  Roland Maas,et al.  Model-based dereverberation in the logmelspec domain for robust distant-talking speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.