Incorporating Knowledge on Segmental Duration in 'HMM'-based Continuous Speech Recognition

This chapter gives a short description of the state of the art in automatic speech recognition (ASR). ASR is presented as a technique in which knowledge about speech has been gradually incorporated. The currently most successful statistical ASR still leaves room for improvement by more appropriate incorporation of knowledge about speech. In this thesis, knowledge about segmental duration is chosen, since this knowledge is particularly lacking in the current ASR systems. The scope and methodology of the current study are discussed. The speech recognisers and databases used in this study are described. An outline of the thesis content is given along with the development of the thesis project.

[1]  Philip C. Woodland,et al.  The development of the 1994 HTK large vocabulary speech recognition system , 1995 .

[2]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Li Deng,et al.  Nonstationary-state hidden Markov model with state-dependent time warping: application to speech recognition , 1994, ICSLP.

[4]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[5]  Victor Zue,et al.  A hierarchical model for phoneme duration in american English , 1989, EUROSPEECH.

[6]  Victor Zue,et al.  Correlation analysis of vowels and their application to speech recognition , 1991, EUROSPEECH.

[7]  Michael Riley,et al.  Prediction of word confusabilities for speech recognition , 1994, ICSLP.

[8]  Steve J. Young,et al.  The use of state tying in continuous speech recognition , 1993, EUROSPEECH.

[9]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[10]  Don X. Sun Feature dimension reduction using reduced-rank maximum likelihood estimation for hidden Markov models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Régine André-Obrecht,et al.  Sound duration modelling and time-variable speaking rate in a speech recognition system , 1993, EUROSPEECH.

[13]  Yifan Gong,et al.  A Bayesian approach to phone duration adaptation for lombard speech recognition , 1993, EUROSPEECH.

[14]  Joseph Picone On modeling duration in context in speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[15]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[18]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[19]  Louis C. W. Pols Speech technology systems: performance and evaluation , 1994 .

[20]  Hermann Ney,et al.  Word graphs: an efficient interface between continuous-speech recognition and language understanding , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Xue Wang,et al.  Integration of context-dependent durational knowledge into HMM-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Philip C. Woodland,et al.  Spontaneous speech recognition for the credit card corpus using the HTK toolkit , 1994, IEEE Trans. Speech Audio Process..

[23]  Louis C. W. Pols,et al.  Formant frequencies of Dutch vowels in a text, read at normal and fast rate , 1990 .

[24]  N. Umeda Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[25]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[26]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[27]  Xue Wang,et al.  Analysis of context-dependent segmental duration for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Lou Boves,et al.  Localizing an automatic inquiry system for public transport information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[30]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[31]  Jan P. H. van Santen,et al.  Deriving text-to-speech durations from natural speech , 1990, SSW.

[32]  Sumio Ohno,et al.  A method for quantitative analysis of the local speech rate , 1995, EUROSPEECH.

[33]  Thippur V. Sreenivas,et al.  On incorporating phonemic constraints in hidden Markov models for speech recognition , 1995, EUROSPEECH.

[34]  Torbjørn Svendsen,et al.  Automatic alignment of phonemic labels with continuous speech , 1990, ICSLP.

[35]  Xue Wang,et al.  Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[36]  Xue Wang,et al.  Impact of dimensionality and correlation of observation vectors in HMM-based speech recognition , 1993, EUROSPEECH.

[37]  Jean-Pierre Martens,et al.  Automatic segmentation and labelling of multi-lingual speech data , 1996, Speech Commun..

[38]  Lawrence R. Rabiner,et al.  Mathematical foundations of hidden Markov models , 1988 .

[39]  Yumi Wakita,et al.  State duration constraint using syllable duration for speech recognition , 1994, ICSLP.

[40]  Shozo Makino,et al.  Spoken word recognition using phoneme duration information estimated from speaking rate of input speech , 1994, ICSLP.

[41]  Fredinand Pitrelli John Hierarchical modeling of phoneme duration : application to speech recognition , 1990 .

[42]  Li Deng,et al.  Analysis of acoustic-phonetic variations in fluent speech using TIMIT , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[43]  ten Lfm Bosch On relations between phone models, segment duration, and the Padé-expansion , 1991 .

[44]  J. V. Santen,et al.  The analysis of contextual effects on segmental duration , 1990 .

[45]  Jean-Pierre Martens,et al.  Fast automatic segmentation and labeling: results on TIMIT and EUROMO , 1995, EUROSPEECH.

[46]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[47]  Xue Wang Durational modelling in HMM-based speech recognition: Towards a justifies measure , 1995 .