Modelling the nonstationarity of speech in the maximum negentropy beamformer

State-of-the-art automatic speech recognition (ASR) systems can achieve very low word error rates (WERs) of below 5% on data recorded with headsets. However, in many situations such as ASR at meetings or in the car, far field microphones on the table, walls or devices such as laptops are preferable to microphones that have to be worn close to the user’s mouths. Unfortunately, the distance between speakers and microphones introduces significant noise and reverberation, and as a consequence the WERs of current ASR systems on this data tend to be unacceptably high (30-50% upwards). The use of a microphone array, i.e. several microphones, can alleviate the problem somewhat by performing spatial filtering: beamforming techniques combine the sensors’ output in a way that focuses the processing on a particular direction. Assuming that the signal of interest comes from a different direction than the noise, this can improve the signal quality and reduce the WER by filtering out sounds coming from non-relevant directions. Historically, array processing techniques developed from research on non-speech data, e.g. in the fields of sonar and radar, and as a consequence most techniques were not created to specifically address beamforming in the context of ASR. While this generality can be seen as an advantage in theory, it also means that these methods ignore characteristics which could be used to improve the process in a way that benefits ASR. An example of beamforming adapted to speech processing is the recently proposed maximum negentropy beamformer (MNB), which exploits the statistical characteristics of speech as follows. “Clean” headset speech differs from noisy or reverberant speech in its statistical distribution, which is much less Gaussian in the clean case. Since negentropy is a measure of non-Gaussianity, choosing beamformer weights that maximise the negentropy of the output leads to speech that is closer to clean speech in its distribution, and this in turn has been shown to lead to improved WERs [Kumatani et al., 2009]. In this thesis several refinements of the MNB algorithm are proposed and evaluated. Firstly, a number of modifications to the original MNB configuration are proposed based on theoretical or practical concerns. These changes concern the probability density function (pdf) used to model speech, the estimation of the pdf parameters, and the method of calculating the negentropy. Secondly, a further step is taken to reflect the characteristics of speech by introducing time-varying pdf parameters. The original MNB uses fixed estimates per utterance, which do not account for the nonstationarity of speech. Several time-dependent variance estimates are therefore proposed, beginning with a simple moving average window and including the HMM-MNB, which derives the variance estimate from a set of auxiliary hidden Markov models. All beamformer algorithms presented in this thesis are evaluated through far-field ASR experiments on the Multi-Channel Wall Street Journal Audio-Visual Corpus, a database of utterances captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. While the proposed methods do not lead to an improvement in ASR performance, a more efficient MNB algorithm is developed, and it is shown that comparable results can be achieved with significantly less data than all frames of the utterance, a result which is of particular relevance for real-time implementations.

[1]  Sven Nordholm,et al.  Filter bank design for subband adaptive microphone arrays , 2003, IEEE Trans. Speech Audio Process..

[2]  H. Brehm,et al.  Description and generation of spherically invariant speech-model signals , 1987 .

[3]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[4]  K. Kumatani,et al.  ON HIDDEN MARKOV MODEL MAXIMUM NEGENTROPY BEAMFORMING , 2008 .

[5]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[6]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[7]  Marc Moonen,et al.  Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction , 2007, Speech Commun..

[8]  Dirk Van Compernolle Noise adaptation in a hidden Markov model speech recognition system , 1989 .

[9]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[10]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[11]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  R. Gallager Information Theory and Reliable Communication , 1968 .

[15]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[16]  Simon Doclo,et al.  Multi-microphone noise reduction and dereverberation techniques for speech applications , 2003 .

[17]  Walter Kellermann,et al.  Adaptive Beamforming for Audio Signal Acquisition , 2003 .

[18]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[19]  Israel Cohen,et al.  An Integrated Real-Time Beamforming and Postfiltering System for Nonstationary Noise Environments , 2003, EURASIP J. Adv. Signal Process..

[20]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[21]  Jan Mark de Haan Filter Bank Design for Subband Adaptive Filtering , 2001 .

[22]  J. Aldrich R.A. Fisher and the making of maximum likelihood 1912-1922 , 1997 .

[23]  Nedelko Grbic,et al.  Optimal and Adaptive Subband Beamforming , 2001 .

[24]  John W. McDonough,et al.  Adaptive Beamforming With a Minimum Mutual Information Criterion , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Harry L. Van Trees,et al.  Optimum Array Processing , 2002 .

[26]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[27]  S. Nordholm,et al.  A spatial filtering approach to robust adaptive beaming , 1992 .

[28]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[29]  R. Fletcher Practical Methods of Optimization , 1988 .

[30]  Kiyohiro Shikano,et al.  Blind source separation based on a fast-convergence algorithm combining ICA and beamforming , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  James L. Massey,et al.  Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[32]  Satoshi Nakamura,et al.  Multichannel Bin-Wise Robust Frequency-Domain Adaptive Filtering and Its Application to Adaptive Beamforming , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Reinhold Häb-Umbach,et al.  Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Steve Young,et al.  WSJCAM0 corpus and recording description , 1994 .

[35]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[36]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[37]  Dietrich Klakow,et al.  Beamforming With a Maximum Negentropy Criterion , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Mark J. F. Gales,et al.  Joint uncertainty decoding for noise robust speech recognition , 2005, INTERSPEECH.

[39]  Walter Kellermann,et al.  Blind Source Separation for Convolutive Mixtures: A Unified Treatment , 2004 .

[40]  Thomas Niesler,et al.  The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[41]  Xue Wang,et al.  Analysis of context-dependent segmental duration for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[42]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[43]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Nobuhiko Kitawaki,et al.  A combined approach of array processing and independent component analysis for blind separation of acoustic signals , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[45]  Mark J. F. Gales,et al.  Cepstral parameter compensation for HMM recognition in noise , 1993, Speech Commun..

[46]  S. Nordholm,et al.  Adaptive beamforming: Spatial filter designed blocking matrix , 1994 .

[47]  P. Vaidyanathan Multirate Systems And Filter Banks , 1992 .

[48]  S. Gannot,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2004, IEEE Trans. Speech Audio Process..

[49]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[50]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[51]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[52]  Chen Yang,et al.  Static and Dynamic Spectral Features: Their Noise Robustness and Optimal Weights for ASR , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[54]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[55]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[56]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[57]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[58]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[59]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[60]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[61]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[62]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[63]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[64]  Li Deng,et al.  Uncertainty decoding with SPLICE for noise robust speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Joerg Bitzer,et al.  Post-Filtering Techniques , 2001, Microphone Arrays.

[66]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..

[67]  M. Varanasi,et al.  Parametric generalized Gaussian density estimation , 1989 .

[68]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[69]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[70]  J. A. Domínguez-Molina A practical procedure to estimate the shape parameter in the generalized Gaussian distribution , 2002 .

[71]  Nicholas W. D. Evans,et al.  An Assessment on the Fundamental Limitations of Spectral Subtraction , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[72]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[73]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[74]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[75]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[76]  Dietrich Klakow,et al.  Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[78]  John W. McDonough,et al.  Tracking and beamforming for multiple simultaneous speakers with probabilistic data association filters , 2006, INTERSPEECH.

[79]  Sven Nordholm,et al.  Adaptive array noise suppression of handsfree speaker input in cars , 1993 .

[80]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[81]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[82]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[83]  Giuseppe Ruggeri,et al.  Performance evaluation and comparison of ITU-T/ETSI voice activity detectors , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[84]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[85]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[86]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[87]  Walter Kellermann,et al.  Frequency-domain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness , 2002, Eur. Trans. Telecommun..

[88]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[89]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[90]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[91]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..