Signal modeling techniques in speech recognition

A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used. The four basic operations of signal modeling, i.e. spectral shaping, spectral analysis, parametric transformation, and statistical modeling, are discussed. Three important trends that have developed in the last five years in speech recognition are examined. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similarity transform techniques, often used to normalize and decorrelate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal's spectrum can be estimated in a closed-loop manner. The signal processing components of these algorithms are reviewed. >

[1]  Amir Averbuch,et al.  An IBM PC based large-vocabulary isolated-utterance speech recognizer , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Panos E. Papamichalis,et al.  Practical approaches to speech coding , 1987 .

[3]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  George R. Doddington Phonetically sensitive discriminants for improved speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Joseph Picone Duration in context clustering for speech recognition , 1990, Speech Commun..

[6]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[7]  Joseph Picone The demographics of speaker independent digit recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[9]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[10]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[11]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[13]  James Glass,et al.  Acoustic segmentation and phonetic classification in the SUMMIT system , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[14]  George R. Doddington,et al.  Robust pitch detection in a noisy telephone environment , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[16]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[17]  B. S. Atal Influence of pitch on formant frequencies and bandwidths obtained by linear prediction analysis , 1974 .

[18]  Hynek Hermansky,et al.  Towards handling the acoustic environment in spoken language processing , 1992, ICSLP.

[19]  S. Kimura 100000-word recognition using acoustic-segment networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  Harvey F. Silverman,et al.  Hidden Markov model/neural network training techniques for connected alphadigit speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[22]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[24]  B Gold,et al.  Parallel processing techniques for estimating pitch periods of speech in the time domain. , 1969, The Journal of the Acoustical Society of America.

[25]  L. R. Rabiner,et al.  On the performance of isolated word speech recognizers using vector quantization and temporal energy contours , 1984, AT&T Bell Laboratories Technical Journal.

[26]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[27]  K. Shikano,et al.  Robust HMM phoneme modeling for different speaking styles , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[29]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[30]  D. Lubensky Word recognition using neural nets, multi-state Gaussian and k-nearest neighbor classifiers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[31]  Sadaoki Furui,et al.  A text-independent speaker recognition method robust against utterance variations , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[32]  G. Doddington,et al.  Low rate speech coding using contour quantization , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  S. Tamura,et al.  An analysis of a noise reduction neural network , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[34]  Fergus McInnes,et al.  Use of acoustic sentence level and lexical stress in HSMM speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[36]  Alex Waibel,et al.  Integrating time alignment and neural networks for high performance continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[37]  Takao Watanabe,et al.  Large vocabulary word recognition based on demi-syllable hidden Markov model using small amount of training data , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[38]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[39]  W Rhode,et al.  Auditory physiology. , 1982, Science.

[40]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[41]  Bishnu S. Atal,et al.  A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[42]  A. Noll Problems of speech recognition in mobile environments , 1990, ICSLP.

[43]  J. Allen,et al.  Cochlear modeling , 1985, IEEE ASSP Magazine.

[44]  Kunio Nakajima,et al.  An optimal discriminative training method for continuous mixture density HMMs , 1990, ICSLP.

[45]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[46]  Sadaoki Furui,et al.  A continuous speech recognition system based on a two-level grammar approach , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[47]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[48]  T. D. Harrison,et al.  A connectionist model for phoneme recognition in continuous speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[49]  W. M. Carey,et al.  Digital spectral analysis: with applications , 1986 .

[50]  Katsuhiko Shirai,et al.  Speaker adaptable phoneme recognition selecting reliable acoustic features based on mutual information , 1990, ICSLP.

[51]  Hy Murveit,et al.  1000-word speaker-independent continuous-speech recognition using hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[52]  Joseph P. Campbell,et al.  A comparison of US Government standard voice coders , 1989, IEEE Military Communications Conference, 'Bridging the Gap. Interoperability, Survivability, Security'.

[53]  Joseph Picone,et al.  Design and implementation of a robust pitch detector based on a parallel processing technique , 1988, IEEE J. Sel. Areas Commun..

[54]  Allen Gersho,et al.  On the structure of vector quantizers , 1982, IEEE Trans. Inf. Theory.

[55]  Victor Zue,et al.  A comparative study of acoustic representations of speech for vowel classification using multi-layer perceptrons , 1990, ICSLP.

[56]  Chin-Hui Lee,et al.  Application of hidden Markov models for recognition of a limited set of words in unconstrained speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[57]  Joseph P. Campbell,et al.  Voiced/Unvoiced classification of speech with applications to the U.S. government LPC-10E algorithm , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  George R. Doddington,et al.  Frame-specific statistical features for speaker independent speech recognition , 1986, IEEE Trans. Acoust. Speech Signal Process..

[59]  T. Martin,et al.  On the effects of varying filter bank parameters on isolated word recognition , 1982 .

[60]  Yunxin Zhao,et al.  Experiments with a speaker-independent continuous speech recognition system on the timit database , 1990, ICSLP.

[61]  Jay G. Wilpon,et al.  Speech recognition: From the laboratory to the real world , 1990, AT&T Technical Journal.

[62]  S. M. Peeling,et al.  The ARM continuous speech recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[63]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[64]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[65]  Kenji Kita,et al.  HMM continuous speech recognition using predictive LR parsing , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[66]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[67]  Bishnu S. Atal,et al.  Predictive Coding of Speech at Low Bit Rates , 1982, IEEE Trans. Commun..

[68]  Joseph Picone,et al.  Voice across America: Toward robust speaker-independent speech recognition for telecommunications applications , 1991, Digit. Signal Process..

[69]  Harvey F. Silverman,et al.  A parametrically controlled spectral analysis system for speech , 1974 .

[70]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[71]  Pietro Laface,et al.  Lexical access to large vocabularies for speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[72]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[73]  J. Makhoul,et al.  Quantization properties of transmission parameters in linear predictive systems , 1975 .

[74]  A recognition time reduction algorithm for large-vocabulary speech recognition , 1990, Speech Commun..

[75]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[76]  J. W. Brown,et al.  Complex Variables and Applications , 1985 .

[77]  Pietro Laface,et al.  Comparison of discrete and continuous HMMs in a CSR task over the telephone , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[78]  J.E. Mazo,et al.  Digital communications , 1985, Proceedings of the IEEE.

[79]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[80]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985, Proceedings of the IEEE.

[81]  George R. Doddington,et al.  Recognition of speech under stress and in noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[82]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[83]  牧野 正三 A Japanese text dictation system based on phoneme recognition and a dependency grammar , 1991 .

[84]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[85]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[86]  Hy Murveit,et al.  Linguistic constraints in hidden Markov model based speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[87]  Hermann Ney,et al.  A 10000-word continuous-speech recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[88]  R. Offereins Book review: Digital control system analysis and design , 1985 .

[89]  Y.-T. Lee,et al.  Information-theoretic distortion measures for speech recognition: theoretical considerations and experimental results , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[90]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[91]  Aaron E. Rosenberg,et al.  Interactive clustering techniques for selecting speaker-independent reference templates for isolated word recognition , 1979 .

[92]  C. K. Yuen,et al.  Digital Filters , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[93]  S. Falk R. Szilard. Plates. XI + 724 S. m. Fig. Englewood Cliffs, New Jersey 1974. Prentice‐Hall , 1975 .

[94]  Alan R. Jones,et al.  Fast Fourier Transform , 1970, SIGP.

[95]  William S. Meisel,et al.  The SSI large-vocabulary speaker-independent continuous speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[96]  M. Hochberg,et al.  Control engineering , 1991, Nature.

[97]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[98]  Li Deng,et al.  Acoustic recognition component of an 86000-word speech recognizer , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[99]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[100]  Chin-Hui Lee,et al.  Improvements in connected digit recognition using higher order spectral and energy features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[101]  J. Picone,et al.  Continuous speech recognition using hidden Markov models , 1990, IEEE ASSP Magazine.

[102]  G. Doddington,et al.  The LPC trace as an HMM development tool , 1988 .

[103]  P. Coleman,et al.  Experiments in hearing , 1961 .

[104]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[105]  Richard M. Schwartz,et al.  Speaker Adaptation from Limited Training in the BBN BYBLOS Speech Recognition System , 1989, HLT.

[106]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[107]  Hermann Dr Ney,et al.  Experiments on mixture-density phoneme-modelling for the speaker-independent 1000-word speech recognition DARPA task , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[108]  Yoshio Nakadai,et al.  A speech recognition method for noise environments using dual inputs , 1990, ICSLP.

[109]  S. Furui On the use of hierarchical spectral dynamics in speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[110]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[111]  Vishwa Gupta,et al.  Integration of acoustic information in a large vocabulary word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.