Prosodic and accentual information for automatic speech recognition

Various aspects relating to the human production and perception of speech have gradually been incorporated into automatic speech recognition systems. Nevertheless, the set of speech prosodic features has not yet been used in an explicit way in the recognition process itself. This study presents an analysis of prosody's three most important parameters, namely energy, fundamental frequency and duration, together with a method for incorporating this information into automatic speech recognition. On the basis of a preliminary analysis, a design is proposed for a prosodic feature classifier in which these parameters are associated with orthographic accentuation. Prosodic-accentual features are incorporated in a hidden Markov model recognizer; their theoretical formulation and experimental setup are then presented. Several experiments were conducted to show how the method performs with a Spanish continuous-speech database. Using this approach to process other database subsets, we obtained a word recognition error reduction rate of 28.91%.

[1]  S. W. Lee Dynamic beam search strategy using prosodic-syntactic information , 1999 .

[2]  Xue Wang,et al.  Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[3]  Malcah Yaeger-Dror,et al.  Register as a variable in prosodic analysis: The case of the English negative , 1996, Speech Commun..

[4]  M. Inés Torres,et al.  Acoustic parameters for place of articulation identification and classification of Spanish unvoiced stops , 1996, Speech Commun..

[5]  Luis A. Hernández Gómez,et al.  Automatic corpus-based training of rules for prosodic generation in text-to-speech , 1997, EUROSPEECH.

[6]  Eduardo López,et al.  Improvement on connected numbers recognition using prosodic information , 1998, ICSLP.

[7]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[8]  Anne Cutler,et al.  Prosodic structure and phonetic processing: a cross-linguistic study , 1997, EUROSPEECH.

[9]  Jan P. H. van Santen Prosodic Modeling in Text-to-Speech Synthesis , 1997 .

[10]  Gökhan Tür,et al.  Modeling the prosody of hidden events for improved word recognition , 1999, EUROSPEECH.

[11]  Gitta P. M. Laan The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and a read speaking style , 1997, Speech Commun..

[12]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[13]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[14]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[15]  Antonio Bonafonte,et al.  A bilingual text-to-speech system in Spanish and catalan , 1997, EUROSPEECH.

[16]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[17]  Hieronymo de fl. Texeda,et al.  Gramática de la lengua española , 1979 .

[18]  Larry P. Heck,et al.  A lognormal tied mixture model of pitch for prosody based speaker recognition , 1997, EUROSPEECH.

[19]  Frederick Jelinek,et al.  A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[20]  Jean Véronis,et al.  A statistical study of pitch target points in five languages , 1998, ICSLP.

[21]  Franz Kummert,et al.  A HMM-based recognition system for perceptive relevant pitch movements of spontaneous German speech , 1998, ICSLP.

[22]  H.L. Rufiner,et al.  Self-organizing neural tree networks , 1998, Proceedings of the 20th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Vol.20 Biomedical Engineering Towards the Year 2000 and Beyond (Cat. No.98CH36286).

[23]  A. Quilis Tratado de fonología y fonética españolas , 1993 .

[24]  Mary P. Harper,et al.  Classification of Thai tone sequences in syllable-segmented speech using the analysis-by-synthesis method , 1999, IEEE Trans. Speech Audio Process..

[25]  Mario Rossi,et al.  IS SYNTACTIC STRUCTURE PROSODICALLY RETRIEVABLE? , 1997 .

[26]  David Burshtein Robust parametric modeling of durations in hidden Markov models , 1996, IEEE Trans. Speech Audio Process..

[27]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[28]  Wayne H. Ward,et al.  Speech recognition , 1997 .

[29]  Janet E. Cahn,et al.  A computational memory and processing model for prosody , 1999 .

[30]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[31]  Johanneke Caspers,et al.  Testing the meaning of four dutch pitch accent types , 1997, EUROSPEECH.

[32]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[33]  Steve Hoskins,et al.  The prosody of broad and narrow focus in English: two experiments , 1997, EUROSPEECH.

[34]  Hsin-Min Wang,et al.  Frameworks for recognition of Mandarin syllables with tones using sub-syllabic units , 1996, Speech Commun..

[35]  Stephanie Seneff,et al.  Improvements in speech understanding accuracy through the integration of hierarchical linguistic, prosodic, and phonological constraints in the jupiter domain , 1998, ICSLP.

[36]  John H. L. Hansen,et al.  Language accent classification in American English , 1996, Speech Commun..

[37]  Laura Bosch,et al.  The role of prosody in infants' native-language discrimination abilities: the case of two phonologically close languages , 1997, EUROSPEECH.

[38]  Johan Bos,et al.  Giving prosody a meaning , 1997, EUROSPEECH.

[39]  Franz Kummert,et al.  A comparative study of HMM-based approaches for the automatic recognition of perceptually relevant aspects of spontaneous German speech melody , 1999, EUROSPEECH.

[40]  Keikichi Hirose,et al.  Detection of prosodic word boundaries by statistical modeling of mora transitions of fundamental frequency contours and its use for continuous speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[41]  Elmar Nöth,et al.  Integrating multiple knowledge sources for word hypotheses graph interpretation , 1999, EUROSPEECH.

[42]  Barbara Heuft,et al.  Towards a prominence-based synthesis system , 1997, Speech Commun..

[43]  Bayya Yegnanarayana,et al.  Word boundary hypothesization for continuous speech in Hindi based on F0 patterns , 1996, Speech Commun..

[44]  Diego H. Milone,et al.  Evolutionary algorithm for speech segmentation , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[45]  Tan Lee,et al.  Cantonese syllable recognition using neural networks , 1999, IEEE Trans. Speech Audio Process..

[46]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[47]  Philip C. Woodland,et al.  The use of accent-specific pronunciation dictionaries in acoustic model training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[49]  Mari Ostendorf,et al.  Prosodic and lexical indications of discourse structure in human-machine interactions , 1997, Speech Commun..

[50]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[51]  Elmar Nöth,et al.  Tempo and its change in spontaneous speech , 1997, EUROSPEECH.

[52]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[53]  Christian Sappok,et al.  Speaker attribution of successive utterances: The role of discontinuities in voice characteristics and prosody , 1996, Speech Commun..

[54]  Halewijn Vereecken,et al.  Improving the phonetic annotation by means of prosodic phrasing , 1997, EUROSPEECH.

[55]  Jean Véronis,et al.  A stochastic model of intonation for text-to-speech synthesis , 1998, Speech Commun..

[56]  Keh-Yih Su,et al.  On jointly learning the parameters in a character-synchronous integrated speech and language model , 1996, IEEE Trans. Speech Audio Process..

[57]  Stephanie Seneff,et al.  A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition , 1998, ICSLP.

[58]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[59]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[60]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[61]  Géza Németh,et al.  Prosody generation for German CTS/TTS systems (from theoretical intonation patterns to practical realisation) , 1997, Speech Commun..

[62]  Katarina Bartkova,et al.  Selective prosodic post-processing for improving recognition of French telephone numbers , 1999, EUROSPEECH.

[63]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[64]  E. Llorach Gramática de la lengua española , 1994 .

[65]  Lou Boves,et al.  Acoustic characteristics of lexical stress in continuous telephone speech , 1999, Speech Commun..

[66]  David Burshtein,et al.  Robust parametric modeling of durations in hidden Markov models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.