Incorporating knowledge on segmental duration in HMM-based continuous speech recognition

Automatic speech recognition (ASR) is a method for recognising spoken messages by computers. In the present-day state-of-the-art ASR, there are two competing approaches. The statistical approach based on HMM (hidden Markov model) currently outperforms the rule-based knowledge-engineering approach. Various attempts to combine the two approaches also exist. The research presented in this thesis took a viewpoint that in both of these approaches knowledge about speech is used, but this knowledge is represented in different ways. Our way to combine the two approaches is an attempt to incorporate specific knowledge into the HMM-based statistical ASR system. The technically feasible methods of knowledge incorporation were sought out in this thesis work, both based on the structure of HMM-based recogniser, and based on the complicated duration regularities in speech data. In Chapter 1, first of all the current state-of-the-art in ASR is reviewed, leading to the conclusion that technical improvements are still necessary and possible for ASR. In our view the history of ASR development can be considered as a gradual process of incorporating specific knowledge about speech into the recognisers. Therefore each improvement generally implies the incorporation of a specific piece of knowledge. The current study concentrates on the knowledge about durational behaviour of the phonetic segments (phones), for reasons that there is a rich body of literature about this knowledge, and that the currently most successful HMM techniques have not appropriately incorporated this knowledge. Having chosen the HMM as the basic recogniser structure for this study, the problem of incorporating durational knowledge has two sides, namely on the one hand the durational behaviour of the HMM, and on the other hand the durational behaviour of the phonetic segments themselves as observed in the actual speech database. We were first of all confronted with the linkage problem between these two aspects, namely that there is no appropriate representation form of this knowledge, that can be used both to collect the knowledge from the database, and to incorporate the knowledge into the HMM-recogniser. This defines the general paradigm of the current study as a methodological one: searching for appropriate representations and searching for feasible ways of incorporation. Other technical specifications for the whole thesis work are also represented in this chapter, such as the use of monophone HMMs (for a manageable complexity and a tangible effect of duration modelling), and the (main) use of the TIMIT multi-speaker

[1]  Victor Zue,et al.  A hierarchical model for phoneme duration in american English , 1989, EUROSPEECH.

[2]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Victor Zue,et al.  Correlation analysis of vowels and their application to speech recognition , 1991, EUROSPEECH.

[4]  Philip C. Woodland,et al.  Spontaneous speech recognition for the credit card corpus using the HTK toolkit , 1994, IEEE Trans. Speech Audio Process..

[5]  Li Deng,et al.  Nonstationary-state hidden Markov model with state-dependent time warping: application to speech recognition , 1994, ICSLP.

[6]  Louis C. W. Pols,et al.  Spectral analysis and identification of Dutch vowels in monosyllabic words , 1977 .

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[10]  Fredinand Pitrelli John Hierarchical modeling of phoneme duration : application to speech recognition , 1990 .

[11]  Xue Wang DURATIONALLY CONSTRAINED TRAINING OF HMM WITHOUT EXPLICIT STATE DURATIONAL PDF , 1994 .

[12]  Jean-Pierre Martens,et al.  Automatic segmentation and labelling of multi-lingual speech data , 1996, Speech Commun..

[13]  Lawrence R. Rabiner,et al.  Mathematical foundations of hidden Markov models , 1988 .

[14]  ten Lfm Bosch On relations between phone models, segment duration, and the Padé-expansion , 1991 .

[15]  Kathleen J. Mullen,et al.  Agricultural Policies in India , 2018, OECD Food and Agricultural Reviews.

[16]  J. V. Santen,et al.  The analysis of contextual effects on segmental duration , 1990 .

[17]  Katsuhiko Shirai,et al.  Phoneme recognition in continuous speech based on mutual information considering phonemic duration and connectivity , 1992, ICSLP.

[18]  Jean-Pierre Martens,et al.  Fast automatic segmentation and labeling: results on TIMIT and EUROMO , 1995, EUROSPEECH.

[19]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[20]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[21]  Xue Wang Durational modelling in HMM-based speech recognition: Towards a justifies measure , 1995 .

[22]  Yumi Wakita,et al.  State duration constraint using syllable duration for speech recognition , 1994, ICSLP.

[23]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[24]  Matthew A. Siegler,et al.  Measuring and Compensating for the Effects of Speech Rate in Large Vocabulary Continuous Speech Recognition , 1995 .

[25]  M. D. Riley Tree-based modeling of segmental durations , 1992 .

[26]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[27]  L. Deng,et al.  State-dependent time warping in the trended hidden Markov model , 1994, Signal Process..

[28]  Leonardus Franciscus Willems,et al.  On the automatic segmentation of transcribed words , 1994 .

[29]  Shozo Makino,et al.  Spoken word recognition using phoneme duration information estimated from speaking rate of input speech , 1994, ICSLP.

[30]  Louis C. W. Pols,et al.  Formant frequencies of Dutch vowels in a text, read at normal and fast rate , 1990 .

[31]  L. Scharf,et al.  Statistical Signal Processing: Detection, Estimation, and Time Series Analysis , 1991 .

[32]  Hermann Ney,et al.  Word graphs: an efficient interface between continuous-speech recognition and language understanding , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Michael Riley,et al.  Prediction of word confusabilities for speech recognition , 1994, ICSLP.

[34]  N. Umeda Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[35]  Louis C. W. Pols Speech technology systems: performance and evaluation , 1994 .

[36]  Xue Wang,et al.  Analysis of context-dependent segmental duration for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  Lou Boves,et al.  Localizing an automatic inquiry system for public transport information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[38]  Xue Wang,et al.  Integration of context-dependent durational knowledge into HMM-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  Harald Singer,et al.  Suprasegmental duration control with matrix parsing in continuous speech recognition , 1993, Speech Commun..

[40]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[41]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[42]  Torbjørn Svendsen,et al.  Automatic alignment of phonemic labels with continuous speech , 1990, ICSLP.

[43]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[44]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[45]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[46]  Steve J. Young,et al.  The use of state tying in continuous speech recognition , 1993, EUROSPEECH.

[47]  Don X. Sun Feature dimension reduction using reduced-rank maximum likelihood estimation for hidden Markov models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[48]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[50]  Jan P. H. van Santen,et al.  Deriving text-to-speech durations from natural speech , 1990, SSW.

[51]  Sumio Ohno,et al.  A method for quantitative analysis of the local speech rate , 1995, EUROSPEECH.

[52]  Thippur V. Sreenivas,et al.  On incorporating phonemic constraints in hidden Markov models for speech recognition , 1995, EUROSPEECH.

[53]  Matti Karjalainen Speech communication, human and machine: by Douglas O'Shaughnessy, INRS-Telecommunication. Publisher: Addison-Wesley Publishing Company, Route 128, Reading, MA 01867, U.S.A., 1987, xviii+568 pp., ISBN 0-201-16520-1 , 1988 .

[54]  Xue Wang,et al.  Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[55]  D. V. Bergem Acoustic and Lexical Vowel Reduction , 1995 .

[56]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[57]  A. Papoulis Probability y statistics , 1990 .

[58]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[59]  Régine André-Obrecht,et al.  Sound duration modelling and time-variable speaking rate in a speech recognition system , 1993, EUROSPEECH.

[60]  Yifan Gong,et al.  A Bayesian approach to phone duration adaptation for lombard speech recognition , 1993, EUROSPEECH.

[61]  Joseph Picone On modeling duration in context in speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[62]  Xue Wang,et al.  Impact of dimensionality and correlation of observation vectors in HMM-based speech recognition , 1993, EUROSPEECH.

[63]  Li Deng,et al.  Analysis of acoustic-phonetic variations in fluent speech using TIMIT , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.