Speech production knowledge in automatic speech recognition.

Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds, and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, a survey of a growing body of work in which such representations are used to improve automatic speech recognition is provided.

[1]  David A. Nix,et al.  Maximum-Likelihood Continuity Mapping (MALCOM): An Alternative to HMMs , 1998, NIPS.

[2]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[3]  Steven Greenberg,et al.  An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language , 2005, Speech Commun..

[4]  Takashi Fukuda,et al.  Distinctive phonetic feature extraction for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[6]  Victor Zue,et al.  The Collection and Preliminary Analysis of a Spontaneous Speech Database , 1989, HLT.

[7]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[8]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models: performance improvements and robustness to noise , 2000, INTERSPEECH.

[9]  Jeung-Yoon Choi,et al.  Detection of consonant voicing: a module for a hierarchical speech recognition system , 1999 .

[10]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[11]  Trevor Darrell,et al.  Production domain modeling of pronunciation for visual speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  James R. Glass,et al.  Segmentation and modeling in segment-based recognition , 1997, EUROSPEECH.

[13]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[14]  Harriet J. Nock,et al.  Techniques for modelling Phonological Processes in Automatic Speech Recognition , 2001 .

[15]  Li Deng,et al.  Target-directed mixture dynamic models for spontaneous speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[16]  Pinquier,et al.  An event-based acoustic-phonetic approach for speech segmentation and E-set recognition , 2002 .

[17]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - MAP decoding and evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18]  Simon King,et al.  Detection of symbolic gestural events in articulatory data for use in structural representations of continuous speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Sacha Krstulovic LPC-based inversion of the DRM articulatory model , 1999, EUROSPEECH.

[20]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Carol Y. Espy-Wilson,et al.  Speech recognition based on phonetic features and acoustic landmarks , 2004 .

[22]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[23]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[24]  Li Deng,et al.  Data-driven model construction for continuous speech recognition using overlapping articulatory features , 2000, INTERSPEECH.

[25]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[26]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[27]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[28]  Li Deng,et al.  Initial evaluation of hidden dynamic models on conversational speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[30]  Simon King,et al.  SVitchboard 1: Small Vocabulary Tasks from Switchboard 1 , 2005 .

[31]  Frederick Jelinek,et al.  Nonreciprocal data sharing in estimating HMM parameters , 1998, ICSLP.

[32]  James R. Glass Finding acoustic regularities in speech: applications to phonetic recognition , 1988 .

[33]  James R. Glass,et al.  Feature-based pronunciation modeling with trainable asynchrony probabilities , 2004, INTERSPEECH.

[34]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[35]  Carol Y. Espy-Wilson,et al.  Knowledge-based parameters for HMM speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Mirjam Wester,et al.  Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[37]  Ellen Eide,et al.  A linguistic feature representation of the speech waveform , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Philip Hoole,et al.  Beyond 2D in articulatory data acquisition and analysis , 2003 .

[39]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[40]  Simon King,et al.  Asynchronous Articulatory Feature Recognition Using Dynamic Bayesian Networks , 2004 .

[41]  Simon King,et al.  Speech recognition in the articulatory domain: investigating an alternative to acoustic HMMs , 2001 .

[42]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[43]  Jianwu Dang,et al.  A physiological model of speech production and the implication of tongue-larynx interaction , 1994, ICSLP.

[44]  Carol Y. Espy-Wilson,et al.  Speech parameterization based on phonetic features: application to speech recognition , 1995, EUROSPEECH.

[45]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[46]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[47]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[48]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[49]  R. Reddy,et al.  Feature extraction segmentation and labeling in the Harpy and Hearsay‐II systems , 1976 .

[50]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[51]  Erik Mcdermott Production models for speech recognition , 2004 .

[52]  Martin J. Russell,et al.  A multiple-level linear/linear segmental HMM with a formant-based intermediate layer , 2005, Comput. Speech Lang..

[53]  Kate Hunicke-Smith,et al.  Effect of Speaking Style on LVCSR Performance , 1996 .

[54]  Li Deng,et al.  Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Mirjam Wester,et al.  An elitist approach to articulatory-acoustic feature classification , 2001, INTERSPEECH.

[56]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[57]  H. Wakita Estimation of vocal-tract shapes from acoustical analysis of the speech wave: The state of the art , 1979 .

[58]  Joe Frankel,et al.  Linear dynamic models for automatic speech recognition , 2004 .

[59]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[60]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[61]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[62]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[63]  Han Shu,et al.  EM training of finite-state transducers and its application to pronunciation modeling , 2002, INTERSPEECH.

[64]  I. Zlokarnik Adding articulatory features to acoustic features for automatic speech recognition , 1995 .

[65]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[66]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[67]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[68]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[69]  Li Deng,et al.  A mixed-level switching dynamic system for continuous speech recognition , 2004, Comput. Speech Lang..

[70]  Partha Niyogi,et al.  Distinctive feature detection using support vector machines , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[71]  Bernd Lochschmidt,et al.  Acoustic-Phonetic Analysis Based on an Articulatory Model , 1982 .

[72]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[73]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[74]  Stephanie Seneff,et al.  Two-stage continuous speech recognition using feature-based models: a preliminary study , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[75]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[76]  Katrin Kirchhoff Syllable-level desynchronisation of phonetic features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[77]  C S Blackburn,et al.  A self-learning predictive model of articulator movements during speech production. , 2000, The Journal of the Acoustical Society of America.

[78]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[79]  Takashi Fukuda,et al.  Noise-robust ASR by using distinctive phonetic features approximated with logarithmic normal distribution of HMM , 2003, INTERSPEECH.

[80]  C. C. Goodyear,et al.  On the use of neural networks in articulatory speech synthesis , 1993 .

[81]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[82]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[83]  Ronald A. Cole,et al.  Performing fine phonetic distinctions: templates versus features , 1990 .

[84]  Andreas Zierdt,et al.  Beyond 2 D in articulatory data acquisition and analysis , 2003 .

[85]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[86]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[87]  Karen Livescu,et al.  Feature-based pronunciation modeling for automatic speech recognition , 2005 .

[88]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[89]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[90]  Hervé Bourlard,et al.  Speech recognition with auxiliary information , 2004, IEEE Transactions on Speech and Audio Processing.

[91]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[92]  A. Juneja,et al.  Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[93]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[94]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[95]  Sam T. Roweis,et al.  Data-driven production models for speech processing , 1999 .

[96]  Simon King,et al.  A hybrid ANN/DBN approach to articulatory feature recognition , 2005, INTERSPEECH.

[97]  Rebecca Bates,et al.  Speaker dynamics as a source of pronunciation variability for continuous speech recognition models , 2004 .

[98]  Jeff A. Bilmes,et al.  WHAT HMMS CAN'T DO , 2004 .

[99]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[100]  Masaaki Honda,et al.  A model of articulator trajectory formation based on the motor tasks of vocal‐tract shapes , 1996 .

[101]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[102]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[103]  Christian Wellekens,et al.  Dynamic lexicon using phonetic features , 2001, INTERSPEECH.

[104]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[105]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.

[106]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[107]  Ken-ichi Iso Speech recognition using dynamical model of speech production , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[108]  Vassilios Digalakis,et al.  Segment-based stochastic models of spectral dynamics for continuous speech recognition , 1992 .

[109]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[110]  Pietro Laface,et al.  Automatic detection and description of syllabic features in continuous speech , 1976 .

[111]  John S. Bridle,et al.  The HDM: a segmental hidden dynamic model of coarticulation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[112]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[113]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[114]  Mark Hasegawa-Johnson,et al.  Landmark-based speech recognition: report of the 2004 Johns Hopkins summer workshop , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[115]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[116]  Antti-Veikko I. Rosti,et al.  Linear Gaussian Models for Speech Recognition , 2004 .

[117]  Martin J. Russell,et al.  Data-driven, nonlinear, formant-to-acoustic mapping for ASR , 2002 .

[118]  John Scott Bridle,et al.  Towards better understanding of the model implied by the use of dynamic features in HMMs , 2004, INTERSPEECH.

[119]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[120]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[121]  Manish D. Muzumdar Automatic acoustic measurement optimization for segmental speech recognition , 1996 .

[122]  Katsuhiko Shirai,et al.  Estimating articulatory motion from speech wave , 1986, Speech Commun..

[123]  Andrew Wilson Howitt,et al.  Vowel landmark detection , 1999, EUROSPEECH.

[124]  R. Fox Modularity and the Motor Theory of Speech Perception , 1994 .

[125]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[126]  Takashi Fukuda,et al.  Noise-robust automatic speech recognition using orthogonalized distinctive phonetic feature vectors , 2003, INTERSPEECH.

[127]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - model and training , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[128]  Jianwu Dang,et al.  Hybrid HMM/BN ASR system integrating spectrum and articulatory features , 2003, INTERSPEECH.

[129]  Mark Hasegawa-Johnson,et al.  Maximum mutual information based acoustic-features representation of phonological features for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[130]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[131]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.

[132]  Simon King,et al.  Speech recognition via phonetically featured syllables , 1998, ICSLP.

[133]  J.S. Suehle,et al.  Impact of the trapping of anode hot holes on silicon dioxide breakdown , 2002, IEEE Electron Device Letters.

[134]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[135]  Simon King,et al.  Articulatory feature recognition using dynamic Bayesian networks , 2007, Comput. Speech Lang..

[136]  Paul Dalsgaard,et al.  Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[137]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[138]  James R. Glass,et al.  Feature-based Pronunciation Modeling for Speech Recognition , 2004, HLT-NAACL.

[139]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[140]  Carol Y. Espy-Wilson,et al.  An event-based acoustic-phonetic approach for speech segmentation and E-set recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[141]  Martin J. Russell,et al.  Probabilistic-trajectory segmental HMMs , 1999, Comput. Speech Lang..

[142]  Heiga Zen,et al.  Reformulating the HMM as a Trajectory Model , 2004 .

[143]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[144]  Eric Vatikiotis-Bateson,et al.  Measuring and Modeling Speech Production , 1998 .

[145]  George H. Freeman,et al.  An HMM‐based speech recognizer using overlapping articulatory features , 1996 .

[146]  Mark J. F. Gales,et al.  Maximum margin training of generative kernels , 2004 .

[147]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.