Discriminative Training for Speech Recognition

Acknowledgements First and foremost, I would like to thank Professor Katsuhiko Shirai, my thesis advisor, for encouraging me to submit this dissertation. I am deeply grateful to him for giving me this opportunity. I am also thankful for the advice and guidance he gave me during the last several years, in spite of his busy schedule and the distance between Waseda University in Tokyo and ATR Laboratories in Kyoto. I would like to express my sincere gratitude to Professor Tetsunori Kobayashi for taking the time to give me valuable feedback throughout the submission process, for helping me with bureaucratic issues that were diicult for me to understand, let alone handle, for helping me organize the thesis draft and reene its contents, and for his patience in answering my repeated, and often repetitive, inquiries. I would also like to thank Professor Yasuo Matsuyama for his penetrating questions and comments , and the lively discussions we had. I also thank Professor Seinosuke Narita for his insightful and enthusiastic feedback. This interaction, with all four committee members, signiicantly helped me clarify the ideas in this dissertation. This thesis is based entirely on work done at ATR, rst in the ATR Auditory and Visual Perception Research Laboratories, then in the ATR Human Information Processing Research Laboratories. Because it is such a dynamic place, with people coming and going all the time, staying at ATR for now more than 9 years allowed me to meet a large number of extremely interesting people, of many diierent backgrounds and nationalities. ATR is a unique environment. I consider myself very fortunate to have experienced it. This dissertation would not have been possible without the support and encouragement of Dr. Yoh'ichi Tohkura, President of ATR Human Information Processing Research Laboratories. I am profoundly grateful to him for his help and advice over the years, and for his relaxed, yet fatherly, style of management. I am equally indebted to Dr. Shigeru Katagiri. This dissertation charts part of the course of our long and fruitful collaboration. I deeply value what he has taught me about statistical pattern recognition, research methodology, paper writing, and peace of mind. I would also like to thank Professor Eiji Yodogawa, of Kogakuin University, for his support and enthusiasm when he was President of ATR Auditory and Visual Perception Research Laboratories, and Dr. Kohei Habara, for his encouragement and helpful advice when he was Executive Vice-President of …

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[3]  Chin-Hui Lee,et al.  Robustness and discrimination oriented speech recognition using weighted HMM and subspace projection approaches , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Shigeru Katagiri,et al.  Prototype-based discriminative training for various speech units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[6]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  M. Sugiyama,et al.  Minimal classification error optimization for a speaker mapping neural network , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[8]  Shigeru Katagiri,et al.  Prototype-based MCE/GPD training for word spotting and connected word recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[10]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[11]  Günther Ruske,et al.  Discriminative training for continuous speech recognition , 1995, EUROSPEECH.

[12]  Biing-Hwang Juang,et al.  Discriminative analysis of distortion sequences in speech recognition , 1993, IEEE Trans. Speech Audio Process..

[13]  Alain Biem,et al.  Feature extraction based on minimum classification error/generalized probabilistic descent method , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Biing-Hwang Juang,et al.  Discriminative template training for dynamic programming speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Steve Renals,et al.  Recent improvements to the ABBOT large vocabulary CSR system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Mitsuru Endo,et al.  Recognition of phonemes in continuous speech using a modified LVQ2 method , 1992 .

[17]  H. Sawai TDNN-LR continuous speech recognition system using adaptive incremental TDNN training , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[18]  George R. Doddington Phonetically sensitive discriminants for improved speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  D. O'Shaughnessy,et al.  Hybrid segmental-LVQ/HMM for large vocabulary speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  A. Nadas,et al.  Decoder selection based on cross-entropies , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Shigeki Sagayama,et al.  Appropriate error criterion selection for continuous speech HMM minimum error training , 1992, ICSLP.

[22]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[23]  Frank K. Soong,et al.  An N-best candidates-based discriminative training for speech recognition applications , 1994, IEEE Trans. Speech Audio Process..

[24]  Edward A. Lee,et al.  Fuzzy vector quantazation applied to hidden Markov modeling , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Chin-Hui Lee,et al.  Speech recognition using weighted HMM and subspace projection approaches , 1994, IEEE Trans. Speech Audio Process..

[26]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[27]  H. Sawai,et al.  Spotting Japanese CV-syllables and phonemes using the time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  H. Gish A minimum classification error, maximum likelihood, neural network , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Horacio Franco,et al.  A new discriminative training algorithm for hidden Markov models , 1990, ICSLP.

[30]  Shigeru Katagiri,et al.  Shift-invariant, multi-category phoneme recognition using Kohonen's LVQ2 , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[31]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[32]  C. Laymon A. study , 2018, Predication and Ontology.

[33]  Brian Hanson,et al.  Enhancing the discrimination of speaker independent hidden Markov models with corrective training , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[34]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[35]  Shigeru Katagiri,et al.  Minimum error training for speech recognition , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[36]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[37]  E. McDermott,et al.  A hybrid speech recognition system using HMMs with an LVQ-trained codebook , 1990 .

[38]  Frank K. Soong,et al.  The use of tree-trellis search for large-vocabulary Mandarin polysyllabic word speech recognition , 1994, Comput. Speech Lang..

[39]  K. Torkkola,et al.  Training continuous density hidden Markov models in association with self-organizing maps and LVQ , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[40]  Régis Cardin,et al.  MMIE training for large vocabulary continuous speech recognition , 1994, ICSLP.

[41]  Yasuhiro Komori,et al.  Minimum error classification training for HMM-based keyword spotting , 1992, ICSLP.

[42]  Jun-ichi Takahashi,et al.  Minimum classification error training for a small amount of data enhanced by vector-field-smoothed Bayesian learning , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[43]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[44]  Lokendra Shastri,et al.  Speech recognition using connectionist networks , 1988 .

[45]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[46]  M. Ostendorf,et al.  Maximum likelihood successive state splitting , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[47]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[48]  Shigeru Katagiri,et al.  GPD training of dynamic programming-based speech recognizers , 1992 .

[49]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[51]  Alex Waibel,et al.  Consonant recognition by modular construction of large phonemic time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[52]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[53]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[54]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[55]  Qiang Huo,et al.  Discriminative training of HMM based speech recognizer with gradient projection method , 1995, EUROSPEECH.

[56]  Biing-Hwang Juang,et al.  Discriminative training of dynamic programming based speech recognizers , 1993, IEEE Trans. Speech Audio Process..

[57]  Shigeru Katagiri,et al.  A generalized probabilistic descent method , 1990 .

[58]  John S. Bridle,et al.  Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[59]  Chin-Hui Lee,et al.  Segmental GPD training of HMM based speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  Mikko Kurimo,et al.  Status Report Of The Finnish Phonetic Typewriter Project , 1991 .

[61]  Yoichi Takebayashi,et al.  Keyword-spotting in noisy continuous speech using word pattern vector subabstraction and noise immunity learning , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[62]  Patrick Haffner,et al.  A new probabilistic framework for connectionist time alignment , 1994, ICSLP.

[63]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[64]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[65]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[66]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[67]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[68]  Günther Ruske,et al.  Discriminative training of stochastic Markov graphs for speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[69]  Erkki Oja,et al.  Subspace methods of pattern recognition , 1983 .

[70]  Yves Normandin Optimal splitting of HMM Gaussian mixture components with MMIE training , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[71]  Shigeru Katagiri,et al.  A new algorithm for representing acoustic feature dynamics , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[72]  Shigeru Katagiri,et al.  On the robustness of HMM and ANN speech recognition algorithms , 1990, ICSLP.

[73]  Alain Biem,et al.  A discriminative filter bank model for speech recognition , 1995, EUROSPEECH.

[74]  Nils J. Nilsson,et al.  The Mathematical Foundations of Learning Machines , 1990 .

[75]  Frank K. Soong,et al.  Discriminative training of high performance speech recognizer using N best candidates , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  Ken-ichi Iso,et al.  Speaker-independent word recognition using dynamic programming neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[77]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[78]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[79]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[80]  Sadaoki Furui,et al.  A study of speaker adaptation based on minimum classification error training , 1995, EUROSPEECH.

[81]  Lokendra Shastri,et al.  Learning Phonetic Features Using Connectionist Networks , 1987, IJCAI.

[82]  John Makhoul,et al.  Discriminant analysis and supervised vector quantization for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[83]  Alex Waibel,et al.  Integrating time alignment and neural networks for high performance continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[84]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[85]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[86]  Hsiao-Wuen Hon,et al.  Large-vocabulary speaker-independent continuous speech recognition using HMM , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[87]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[88]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[89]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[90]  Frank K. Soong,et al.  A fast algorithm for large vocabulary keyword spotting application , 1994, IEEE Trans. Speech Audio Process..

[91]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Biing-Hwang Juang,et al.  Speaker recognition based on minimum error discriminative training , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[93]  Harvey F. Silverman,et al.  Combining hidden Markov model and neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[94]  R. Raghavan,et al.  Gradient descent fails to separate , 1988, IEEE 1988 International Conference on Neural Networks.

[95]  P. Gallinari,et al.  A speech recognizer optimally combining learning vector quantization, dynamic programming and multi-layer perceptron , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[96]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[97]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[98]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[99]  Shigeru Katagiri,et al.  A new HMM/LVQ hybrid algorithm for speech recognition , 1990, [Proceedings] GLOBECOM '90: IEEE Global Telecommunications Conference and Exhibition.

[100]  Rafid A. Sukkar,et al.  Rejection for connected digit recognition based on GPD segmental discrimination , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[101]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[102]  B. Juang,et al.  A study on minimum error discriminative training for speaker recognition , 1995 .

[103]  Steve J. Young,et al.  Discriminative optimisation of large vocabulary recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[104]  Teuvo Kohonen,et al.  The 'neural' phonetic typewriter , 1988, Computer.

[105]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[106]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[107]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[108]  E. McDermott,et al.  Re-evaluation of LVQ-HMM hybrid algorithm , 1993 .

[109]  Hervé Bourlard,et al.  Speech pattern discrimination and multilayer perceptrons , 1989 .

[110]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[111]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[112]  Victor W. Zue Speech Database Development , 1988 .

[113]  Frank Fallside,et al.  Phoneme Recognition from the TIMIT database using Recurrent Error Propa-gation Networks , 1990 .

[114]  Bernie Mulgrew,et al.  IEEE Workshop on Neural Networks for Signal Processing , 1995 .

[115]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[116]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[117]  Shigeru Katagiri,et al.  Speaker-independent large vocabulary word recognition using an LVQ/HMM hybrid algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[118]  Nikko Strom Optimising the lexical representation to improve A* lexical search , 1994 .

[119]  T. Kohonen,et al.  Statistical pattern recognition with neural networks: benchmarking studies , 1988, IEEE 1988 International Conference on Neural Networks.

[120]  Mikko Kurimo,et al.  Using LVQ to enhance semi-continuous hidden Markov models for phonemes , 1993, EUROSPEECH.

[121]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[122]  Alexander H. Waibel,et al.  A novel objective function for improved phoneme recognition using time delay neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[123]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[124]  Shigeru Katagiri,et al.  Shift-tolerant LVQ and hybrid LVQ-HMM for phoneme recognition , 1990 .

[125]  E. Mcdermott,et al.  LVQ3 for phoneme recognition , 1990 .

[126]  Shigeru Katagiri,et al.  An optimal learning method for minimizing spotting errors , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[127]  Lawrence R. Rabiner,et al.  A minimum discrimination information approach for hidden Markov modeling , 1989, IEEE Trans. Inf. Theory.

[128]  Victor Zue,et al.  A* word network search for continuous speech recognition , 1993, EUROSPEECH.

[129]  Shigeru Katagiri,et al.  A telephone-based directory assistance system adaptively trained using minimum classification error/generalized probabilistic descent , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[130]  Shigeru Katagiri,et al.  A new connected word recognition algorithm based on HMM/LVQ segmentation and LVQ classification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[131]  Steve Young,et al.  The use of syntax and multiple alternatives in the VODIS voice operated database inquiry system , 1991 .

[132]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[133]  James Glass,et al.  Integration of speech recognition and natural language processing in the MIT VOYAGER system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.