A syllable, articulatory-feature, and stress-accent model of speech recognition

Current-generation automatic speech recognition (ASR) systems assume that words are readily decomposable into constituent phonetic components (“phonemes”). A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic “beads-on-a-string” approach is of limited utility, particularly with respect to informal, conversational material. The study shows that there is a significant gap between the observed data and the pronunciation models of current ASR systems. It also shows that many important factors affecting recognition performance are not modeled explicitly in these systems. Motivated by these findings, this dissertation analyzes spontaneous speech with respect to three important, but often neglected, components of speech (at least with respect to English ASR). These components are articulatory-acoustic features (AFs), the syllable and stress accent. Analysis results provide evidence for an alternative approach of speech modeling, one in which the syllable assumes pre-eminent status and is melded to the lower as well as the higher tiers of linguistic representation through the incorporation of prosodic information such as stress accent. Using concrete examples and statistics from spontaneous speech material it is shown that there exists a systematic relationship between the realization of AFs and stress accent in conjunction with syllable position. This relationship can be used to provide an accurate and parsimonious characterization of pronunciation variation in spontaneous speech. An approach to automatically extract AFs from the acoustic signal is also developed, as is a system for the automatic stress-accent labeling of spontaneous speech. Based on the results of these studies a syllable-centric, multi-tier model of speech recognition is proposed. The model explicitly relates AFs, phonetic segments and syllable constituents to a framework for lexical representation, and incorporates stress-accent information into recognition. A test-bed implementation of the model is developed using a fuzzy-based approach for combining evidence from various AF sources and a pronunciation-variation modeling technique using AF-variation statistics extracted from data. Experiments on a limited-vocabulary speech recognition task using both automatically derived and fabricated data demonstrate the advantage of incorporating AF and stress-accent modeling within the syllable-centric, multi-tier framework, particularly with respect to pronunciation variation in spontaneous speech.

[1]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[2]  Carole Paradis,et al.  THE SPECIAL STATUS OF CORONALS: INTERNAL AND EXTERNAL EVIDENCE , 1991 .

[3]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[4]  Radko Mesiar,et al.  K-Order Additive Fuzzy Measures , 1999, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Hermann Ney,et al.  The Philips Research system for continuous-speech recognition , 1992 .

[6]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.

[7]  Abeer Alwan,et al.  Towards articulatory speech recognition: learning smooth maps to recover articulator information , 1997, EUROSPEECH.

[8]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[9]  P LippmannRichard Speech recognition by machines and humans , 1997 .

[10]  A. C. Gimson,et al.  An introduction to the pronunciation of English , 1991 .

[11]  Elizabeth Shriberg,et al.  Phonetic Consequences of Speech Disfluency , 1999 .

[12]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[13]  Jürgen Fritsch Hierarchical connectionist acoustic modeling for domain adaptive large vocabulary speech recognition , 2000 .

[14]  Willem H. Vieregge,et al.  Intra- and interspeaker variation of /r/ in dutch , 1993, EUROSPEECH.

[15]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[16]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Steven Greenberg,et al.  The Relation Between Stress Accent and Vocalic Identity in Spontaneous American English Discourse , 2001 .

[19]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[20]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[21]  M. Grabisch Fuzzy integral in multicriteria decision making , 1995 .

[22]  Steven Greenberg,et al.  Beyond the phoneme: a juncture-accent model of spoken language , 2002 .

[23]  Steven Greenberg,et al.  LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[24]  Steven Greenberg,et al.  Robust Phonetic Feature Extraction Under a Wide Range of Noise Backgrounds and Signal-to-Noise Ratios , 2001 .

[25]  M. Sugeno,et al.  An interpretation of fuzzy measures and the Choquet integral as an integral with respect to a fuzzy , 1989 .

[26]  Jeffrey M. Zacks,et al.  A new neural network for articulatory speech recognition and its application to vowel identification , 1994, Comput. Speech Lang..

[27]  Richard M. Stern,et al.  Automatic generation of subword units for speech recognition systems , 2002, IEEE Trans. Speech Audio Process..

[28]  L. S. Shapley,et al.  17. A Value for n-Person Games , 1953 .

[29]  Michel Grabisch,et al.  Fuzzy integral for classification and feature extraction , 2000 .

[30]  Kate Hunicke-Smith,et al.  Effect of Speaking Style on LVCSR Performance , 1996 .

[31]  菅野 道夫,et al.  Theory of fuzzy integrals and its applications , 1975 .

[32]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[33]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[34]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[35]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[36]  Steven Greenberg,et al.  PROSODIC STRESS REVISITED: REASSESSING THE ROLE OF FUNDAMENTAL FREQUENCY , 2000 .

[37]  M. Sugeno FUZZY MEASURES AND FUZZY INTEGRALS—A SURVEY , 1993 .

[38]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[39]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[40]  Keith A. Johnson Speech Physiology, Speech Perception, and Acoustic Phonetics , 1992 .

[41]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[42]  L. Boves,et al.  A SPOKEN DIALOGUE SYSTEM FOR PUBLIC TRANSPORT INFORMATION , 1995 .

[43]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[44]  L Deng,et al.  Structural design of hidden Markov model speech recognizer using multivalued phonetic features: comparison with segmental speech units. , 1992, The Journal of the Acoustical Society of America.

[45]  Li Deng,et al.  An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. , 2002, The Journal of the Acoustical Society of America.

[46]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[47]  Yochai Konig,et al.  Modeling dynamics in connectionist speech recognition - the time index model , 1994, ICSLP.

[48]  Li Deng,et al.  Large vocabulary word recognition using context-dependent allophonic hidden Markov models☆ , 1990 .

[49]  L. Shastri,et al.  SYLLABLE DETECTION AND SEGMENTATION USING TEMPORAL FLOW NEURAL NETWORKS , 1999 .

[50]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[51]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[52]  Patricia A. Keating,et al.  CORONAL PLACES OF ARTICULATION , 1991 .

[53]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[54]  Tuan D. Pham,et al.  Combination of Handwritten-Numeral Classifiers with Fuzzy Integral , 1999 .

[55]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[56]  I. Lehiste chapter 7 – Suprasegmental Features of Speech , 1976 .

[57]  Steven Greenberg,et al.  Automatic phonetic transcription of spontaneous speech (american English) , 2000, INTERSPEECH.

[58]  Nelson Morgan,et al.  Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[59]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[60]  Chin-Hui Lee,et al.  Word recognition using whole word and subword models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[61]  David Zipser,et al.  Subgrouping Reduces Complexity and Speeds Up Learning in Recurrent Networks , 1989, NIPS.

[62]  Julie Carson-Berndsen,et al.  Defining constraints for multilinear speech processing , 2001, INTERSPEECH.

[63]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[64]  Mary-Louise Kean,et al.  The theory of markedness in generative grammar , 1980 .

[65]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[66]  S. Greenberg,et al.  Automatic Detection of Prosodic Stress in American English Discourse , 2000 .

[67]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[68]  Lin-Shan Lee,et al.  Voice dictation of Mandarin Chinese , 1997, IEEE Signal Process. Mag..

[69]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[70]  Steven Greenberg,et al.  The Relation Between Stress Accent and Pronunciation Variation in Spontaneous American English Discourse , 2002 .

[71]  Steven Greenberg,et al.  Speech intelligibility in the presence of cross-channel spectral asynchrony , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[72]  Michel Grabisch,et al.  Classification by fuzzy integral: performance and tests , 1994, CVPR 1994.

[73]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[74]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[75]  Mirjam Wester,et al.  An elitist approach to articulatory-acoustic feature classification , 2001, INTERSPEECH.

[76]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[77]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[78]  Carole Paradis,et al.  INTRODUCTION: ASYMMETRY AND VISIBILITY IN CONSONANT ARTICULATIONS , 1991 .

[79]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[80]  J. Martens,et al.  Pronunciation Variation Modeling for Dutch Automatic Speech Recognition , 2002 .

[81]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[82]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.

[83]  Steven Greenberg,et al.  Integrating syllable boundary information into speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[84]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[85]  Sung-Bae Cho,et al.  Combining multiple neural networks by fuzzy integral for robust classification , 1995, IEEE Trans. Syst. Man Cybern..

[86]  Steven Greenberg,et al.  The temporal properties of spoken Japanese are similar to those of English , 1997, EUROSPEECH.

[87]  Andreas Stolcke,et al.  Modeling word-level rate-of-speech variation in large vocabulary conversational speech recognition , 2003, Speech Commun..

[88]  Hauke Schramm,et al.  Towards discriminative lexicon optimization , 2001, INTERSPEECH.

[89]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[90]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[91]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[93]  M. Lindau The story of /r/ , 1980 .

[94]  P. Ladefoged A course in phonetics , 1975 .

[95]  M. L. Shire Syllable onset detection from acous-tics , 1997 .

[96]  Steven Greenberg,et al.  The relation between speech intelligibility and the complex modulation spectrum , 2001, INTERSPEECH.

[97]  Lokendra Shastri,et al.  Learning Phonetic Features Using Connectionist Networks , 1987, IJCAI.

[98]  Raymond L. Watrous Phoneme Discrimination Using Connectionist Networks , 1993, Machine Learning: From Theory to Applications.

[99]  Michel Grabisch,et al.  A new algorithm for identifying fuzzy measures and its application to pattern recognition , 1995, Proceedings of 1995 IEEE International Conference on Fuzzy Systems..

[100]  Raymond L. Watrous GRADSIM: A Connectionist Network Simulator Using Gradient Optimization Techniques , 1988 .

[101]  M. Beckman Stress And Non-Stress Accent , 1986 .

[102]  G. Klir,et al.  Fuzzy Measure Theory , 1993 .

[103]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[104]  Steven Greenberg,et al.  AN INTRODUCTION TO THE DIAGNOSTIC EVALUATION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[105]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[106]  Rhys James Jones,et al.  Continuous speech recognition using syllables , 1997, EUROSPEECH.

[107]  Lokendra Shastri,et al.  A hybrid system for handprinted word recognition , 1993, Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.

[108]  Steve J. Young,et al.  Statistical Modeling in Continuous Speech Recognition (CSR) , 2001, UAI.

[109]  Lokendra Shastri,et al.  Speech recognition using connectionist networks , 1988 .

[110]  Vaibhava Goel,et al.  Syllable-a promising recognition unit for LVCSR , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[111]  Michel Grabisch,et al.  Application of the Choquet integral in multicriteria decision making , 2000 .

[112]  E Carlson,et al.  Aspects of voice quality: display, measurement and therapy. , 1998, International journal of language & communication disorders.

[113]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[114]  Ramesh A. Gopinath,et al.  The IBM Personal Speech Assistant , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[115]  N. Morgan,et al.  INCORPORATING CONTEXTUAL PHONETICS INTO AUTOMATIC SPEECH RECOGNITION , 1999 .

[116]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[117]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[118]  Li Deng,et al.  Distributed speech processing in miPad's multimodal user interface , 2002, IEEE Trans. Speech Audio Process..

[119]  Li Deng,et al.  Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[120]  Lou Boves,et al.  Acoustic characteristics of lexical stress in continuous telephone speech , 1999, Speech Commun..

[121]  Lawrence K. Saul,et al.  A statistical model for robust integration of narrowband cues in speech , 2001, Comput. Speech Lang..

[122]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[123]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[124]  Steven Greenberg,et al.  Vowel height is intimately associated with stress accent in spontaneous american English discourse , 2001, INTERSPEECH.

[125]  Steven Greenberg,et al.  From here to utility - melding phonetic insight with speech technology , 2001, INTERSPEECH.

[126]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.