Symbolic phonetic features for modeling of pronunciation variation

A significant source of variation in spontaneous speech is due to intra-speaker pronunciation changes, often realized as small feature changes, e.g., nasalized vowels or affricated stops, rather than full phone transformations. Previous computational modeling of pronunciation variation has typically involved transformations from one phone to another, in part because most speech processing systems use phone-based units. Here, a phonetic-feature-based prediction model is presented where phones are represented by a vector of symbolic features that can be on, off, unspecified or unused. Feature interaction is examined using different groupings of possibly dependent features, and a hierarchical grouping with conditional dependencies led to the best results. Feature-based models are shown to be more efficient than phone-based models, in the sense of requiring fewer parameters to predict variation while giving smaller distance and perplexity values when comparing predictions to the hand-labeled reference. A parsimonious model is better suited to incorporating new conditioning factors, and this work investigates high-level information sources, including both text (syntax, discourse) and prosody cues. Experiments show that feature-based models benefit from prosody cues, but not text, and that phone-based models do not benefit from any of the high-level cues explored here.

[1]  Li Deng,et al.  Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Carol Y. Espy-Wilson,et al.  A feature‐based semivowel recognition system , 1994 .

[4]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[5]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[6]  Rebecca Bates,et al.  Speaker dynamics as a source of pronunciation variability for continuous speech recognition models , 2004 .

[7]  Carol Y. Espy-Wilson,et al.  SIGNIFICANCE OF INVARIANT ACOUSTIC CUES IN A PROBABILISTIC FRAMEWORK FOR LANDMARK-BASED SPEECH RECOGNITION , 2004 .

[8]  James R. Glass,et al.  Feature-based pronunciation modeling with trainable asynchrony probabilities , 2004, INTERSPEECH.

[9]  Lotfi A. Zadeh,et al.  Phonological structures for speech recognition , 1989 .

[10]  Takashi Fukuda,et al.  Distinctive phonetic feature extraction for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Eric Fosler-Lussier,et al.  Multi-level decision trees for static and dynamic pronunciation models , 1999, EUROSPEECH.

[12]  Sharlene A. Liu,et al.  Landmark detection for distinctive feature-based speech recognition , 1996 .

[13]  Daniel Jurafsky,et al.  Building multiple pronunciation models for novel words using exploratory computational phonology , 1995, EUROSPEECH.

[14]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[15]  L Deng,et al.  Structural design of hidden Markov model speech recognizer using multivalued phonetic features: comparison with segmental speech units. , 1992, The Journal of the Acoustical Society of America.

[16]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[17]  Carol Y. Espy-Wilson,et al.  Speech recognition based on phonetic features and acoustic landmarks , 2004 .

[18]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[19]  Alexander H. Waibel,et al.  Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition , 1997, EUROSPEECH.

[20]  Stefanie Shattuck-Hufnagel,et al.  Glottalization of word-initial vowels as a function of prosodic structure , 1996 .

[21]  Henning Reetz CONVERTING SPEECH SIGNALS TO PHONOLOGICAL FEATURES , 1999 .

[22]  Harald Singer,et al.  Multiple pronunciation dictionary using HMM-state confusion characteristics , 1999, Comput. Speech Lang..

[23]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[25]  Jianwu Dang,et al.  Integration of articulatory dynamic parameters in HMM/BN based speech recognition system , 2004, INTERSPEECH.

[26]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[27]  Carol Y. Espy-Wilson,et al.  Acoustic parameters for automatic detection of nasal manner , 2004, Speech Commun..

[28]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[29]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Tanja Schultz,et al.  Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[31]  Carol Y. Espy-Wilson,et al.  Acoustic analysis and modeling of speech based on phonetic features , 1998 .

[32]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[33]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[34]  Aditi Lahiri SPEECH RECOGNITION WITH PHONOLOGICAL FEATURES , 1999 .

[35]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[36]  Carol Y. Espy-Wilson,et al.  Acoustic-phonetic speech parameters for speaker-independent speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[38]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[39]  W. Bright International Encyclopedia of Linguistics , 1993 .

[40]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[41]  Alex Waibel,et al.  Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode , 1999 .

[42]  Elizabeth Sagey The representation of features in non-linear phonology : the articulator node hierarchy , 1992 .

[43]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[44]  Stefanie Shattuck-Hufnagel,et al.  Implementation of a model for lexical access based on features , 1992, ICSLP.

[45]  Hagen Soltau,et al.  Compensating for hyperarticulation by modeling articulatory properties , 2002, INTERSPEECH.

[46]  William D. Raymond,et al.  Reduction of English function words in switchboard , 1998, ICSLP.

[47]  Ariel Salomon,et al.  Detection of speech landmarks: use of temporal information. , 2004, The Journal of the Acoustical Society of America.

[48]  Katrin Kirchhoff Syllable-level desynchronisation of phonetic features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[49]  Carol Y. Espy-Wilson,et al.  The design of acoustic parameters for speaker-independent speech recognition , 1997, EUROSPEECH.

[50]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[51]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[52]  M. Halle,et al.  On Feature Spreading and the Representation of Place of Articulation , 2000, Linguistic Inquiry.

[53]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[54]  Simon King,et al.  Speech recognition via phonetically featured syllables , 1998, ICSLP.

[55]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[56]  Harriet J. Nock,et al.  Techniques for modelling Phonological Processes in Automatic Speech Recognition , 2001 .

[57]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[58]  A. Marchal,et al.  Speech production and speech modelling , 1990 .

[59]  Mark Hasegawa-Johnson,et al.  Landmark-based speech recognition: report of the 2004 Johns Hopkins summer workshop , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..