Spectral Signal Processing for ASR

The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The “standard” acoustic analysis for automatic speech recognition, consisting of melscale cepstrum coefficients and their temporal derivatives, is described. Some variations and extensions of the standard analysis — PLP, cepstrum correlation methods, LDA, and variants on log power — are then discussed. These techniques pass the test of having been found useful at multiple sites, especially with noisy speech. The extent to which auditory properties can account for the advantage found for particular techniques is considered. It is concluded that the advantages do not in fact stem from auditory properties, and that there is so far little or no evidence that the study of the human auditory system has contributed to advances in automatic speech recognition. Contributions in the future are not, however, ruled out.

[1]  M. Hunt,et al.  Distance measures for speech recognition , 1989 .

[2]  Parcor Coeff,et al.  Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features , 1981 .

[3]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  Dirk Van Compernolle,et al.  Optimal feature sub-space selection based on discriminant analysis , 1999, EUROSPEECH.

[5]  Nelson Morgan Temporal Signal Processing for ASR , 1999 .

[6]  D. C. Bateman,et al.  Spectral contrast normalization and other techniques for speech recognition in noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[8]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Lalit R. Bahl,et al.  Speech recognition with continuous-parameter hidden Markov models , 1987, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Herman J. M. Steeneken,et al.  Mutual dependence of the octave-band weights in predicting speech intelligibility , 1999, Speech Commun..

[12]  Patrice Alexandre,et al.  Root adaptive homomorphic deconvolution schemes for speech recognition in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[14]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[15]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[17]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[19]  Anders Eriksson,et al.  Difference limen for formant frequency discrimination at high fundamental frequencies , 1999, EUROSPEECH.

[20]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[21]  Melvyn J. Hunt,et al.  A discriminatively derived linear transform for improved speech recognition , 1993, EUROSPEECH.

[22]  Hynek Hermansky,et al.  Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP) , 1991, EUROSPEECH.

[23]  Hans-Günter Hirsch,et al.  Improved speech recognition using high-pass filtering of subband envelopes , 1991, EUROSPEECH.

[24]  Leon Cohen,et al.  Fitting the Mel scale , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[25]  Dennis H. Klatt,et al.  A digital filter bank for spectral matching , 1976, ICASSP.

[26]  Kuldip K. Paliwal,et al.  On the performance of the quefrency-weighted cepstral coefficients in vowel recognition , 1982, Speech Commun..

[27]  John Makhoul,et al.  Spectral linear prediction: Properties and applications , 1975 .

[28]  G. Doddington,et al.  High performance speaker‐independent word recognition , 1978 .

[29]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[30]  M. Hunt,et al.  Speaker dependent and independent speech recognition experiments with an auditory model , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[31]  Patrice Alexandre,et al.  Root cepstral analysis: A unified view. Application to speech processing in car noise environments , 1993, Speech Commun..

[32]  M. Hunt A statistical approach to metrics for word and syllable recognition , 1979 .

[33]  Sadaoki Furui,et al.  Comparison of speaker recognition methods using statistical features and dynamic features , 1981 .

[34]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.