Robust Speech Recognition in a Car Using a Microphone Array

Performance of automatic speech recognition relies on a vast amount of training speech data mostly recorded with little or no background noise. The performance degrades significantly with background noise, which increases type mismatch between train and test environments. Speech enhancement techniques can reduce the amount of type mismatch. At very low SNR with nonstationary noise, the enhanced speech may still contain significant noise either in noise-only segments or speech segments. The former masquerade as nonexistent speech and the latter as distorted speech. Both significantly degrade the performance of the automatic speech recognizer. This encourages the use of voice activity detection (VAD) algorithms to determine regions with speech present. To use only the reliable speech features, we need to further determine whether the features from the speech region are mainly from speech or from nonstationary noises masking the speech. For more robust speech recognition, this thesis proposes a three-hypothesis VAD consisting of H0: noise-only region; HS: speech-dominant speech region; and HN: noise-dominant speech region. Spectrum-based VAD uses knowledge of the noise spectrum to detect voice activity using the nonstationary nature of speech. This thesis proposes a method of estimating the instantaneous noise spectrum for VAD. The spectrum-based VAD, however, cannot distinguish speech from nonstationary noise because both appear nonstationary to the VAD, and thus look like speech. A microphone array can determine the noise-corrupted speech region when the nonstationary noise is from a location other than that of the speech source. This thesis proposes a method of distinguishing HS from H N based on the steered response power (SRP) method, which estimates power from any location. Phonemic restoration is a phenomenon in which humans claim to hear missing phonemes that have been replaced by noise. Given strong nonstationary noises occasionally masking the speech region, as well as knowledge of H S and HN, this thesis proposes a phoneme restoration approach for automatic speech recognition in the hidden Markov model framework. The proposed approach has two steps: speech enhancement as a preprocessor of noisy speech signals, followed by the phoneme restoration for robust speech recognition against nonstationary noises given knowledge of H S and HN.

[1]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[2]  Qiru Zhou,et al.  Robust endpoint detection and energy normalization for real-time speech and speaker recognition , 2002, IEEE Trans. Speech Audio Process..

[3]  Aggelos K. Katsaggelos,et al.  Sound source separation via computational auditory scene analysis-enhanced beamforming , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[4]  A G Samuel,et al.  Attention within auditory word perception: insights from the phonemic restoration illusion. , 1986, Journal of experimental psychology. Human perception and performance.

[5]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[6]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[7]  Maurizio Omologo,et al.  Use of real and contaminated speech for training of a hands-free in-car speech recognizer , 2001, INTERSPEECH.

[8]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[9]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[10]  Henry Stark,et al.  Probability, Random Processes, and Estimation Theory for Engineers , 1995 .

[11]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[12]  A. Samuel The role of bottom-up confirmation in the phonemic restoration illusion. , 1981, Journal of experimental psychology. Human perception and performance.

[13]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[14]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[15]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[16]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[17]  Man-Hung Siu,et al.  Robust speech recognition against packet loss , 2001, INTERSPEECH.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[20]  Juan Arturo Nolazco-Flores,et al.  Continuous speech recognition in noise using spectral subtraction and HMM adaptation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[22]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[23]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[24]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[25]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[26]  Daniel P. W. Ellis,et al.  The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[27]  Dieter Leckschat,et al.  Optimized second-order gradient microphone for hands-free speech recordings in cars , 2001, Speech Commun..

[28]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[29]  George R. Doddington,et al.  Recognition of speech under stress and in noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[31]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[33]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[34]  R. M. Warren,et al.  Speech perception and phonemic restorations , 1971 .

[35]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[36]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[37]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[39]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[40]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[41]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[42]  David Malah,et al.  Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[43]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[44]  Ma Conway,et al.  Handbook of perception and cognition , 1996 .

[45]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[46]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[47]  John H. L. Hansen,et al.  Lombard effect compensation for robust automatic speech recognition in noise , 1990, ICSLP.

[48]  DeLiang Wang,et al.  A schema-based model for phonemic restoration , 2005, Speech Commun..

[49]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[50]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[51]  A. Kondoz,et al.  Analysis and improvement of a statistical model-based voice activity detector , 2001, IEEE Signal Processing Letters.

[52]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[53]  A. Samuel Lexical uniqueness effects on phonemic restoration , 1987 .

[54]  Hong-Seok Kim,et al.  Performance of an HMM speech recognizer using a real-time tracking microphone array as input , 1999, IEEE Trans. Speech Audio Process..

[55]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[56]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[57]  K. U. Simmer,et al.  Multi-microphone noise reduction techniques as front-end devices for speech recognition , 2000, Speech Commun..

[58]  R. M. Warren,et al.  Phonemic restorations based on subsequent context , 1974 .

[59]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[60]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[61]  DeLiang Wang,et al.  Schema-based modeling of phonemic restoration , 2003, INTERSPEECH.

[62]  A. Samuel Phonemic restoration: insights from a new methodology. , 1981, Journal of experimental psychology. General.

[63]  Maurizio Omologo,et al.  Experiments of hands-free connected digit recognition using a microphone array , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[64]  David Salesin,et al.  Modeling and Animating Realistic Faces from Images , 2002, International Journal of Computer Vision.

[65]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[66]  Harry L. Van Trees,et al.  Detection, Estimation, and Modulation Theory, Part I , 1968 .

[67]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[68]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[69]  Thomas S. Huang,et al.  Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Björn Granström,et al.  Audiovisual representation of prosody in expressive speech communication , 2004, Speech Commun..

[71]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[72]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[73]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[74]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[75]  Gaël Richard,et al.  The speechdat-car multilingual speech databases for in-car applications: some first validation results , 1999, EUROSPEECH.

[76]  Soo Ngee Koh,et al.  Improved noise suppression filter using self-adaptive estimator of probability of speech absence , 1999, Signal Process..

[77]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[78]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[79]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[80]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[81]  Detlev Langmann,et al.  CSDC - The MoTiV Car-Speech Data Collection , 1998 .

[82]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[83]  Wonyong Sung,et al.  A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[84]  John H. L. Hansen,et al.  "CU-move": robust speech processing for in-vehicle speech systems , 2000, INTERSPEECH.