Subword lexical modelling for speech recognition

In this work, we introduce and develop a novel framework, scANGIE, for modelling subword lexical phenomena in speech recognition. Our framework provides a flexible and powerful mechanism for capturing morphology, syllabification, phonology, and other subword effects in a hierarchical manner which maximizes sharing of subword structures. A scNGIE models the subword structure within a context-free grammar and an accompanying probability model. We believe that our framework has several advantages: The sharing mechanism allows training data to be pooled amongst instances of the same word substructure even when they occur across different words in the lexicon. Further, knowledge of this substructure can be extended to filler models in a word-spotter, new words added incrementally to a recognizer's vocabulary, and potentially in support of new word detection. The context-free foundation allows for ease of research and experimentation with varying subword representations, and also facilitates integration with a natural language understanding system. Finally, the availability of subword structural information in a recognition system enables exploration of prosodic models which use this information. In this thesis, we demonstrate scANGIE's feasibility and efficacy in a variety of applications. Using scATIS corpus data, we show that scANGIE results in performance improvements on phonetic recognition, reducing error rate from 39.8% to 36.1% as compared to a phone bigram baseline. We show its competitiveness in the task of word-spotting, where we also report on a comparative study of different subword lexical models for the filler space. The FOM results ranged from 85.3 for a phone bigram to 89.3 for a system using the full scANGIE parse tree and a lexicon of 1200 words. We also discuss an implementation of a competitive continuous speech recognition system based on scANGIE, which achieves a recognition error rate of 18.8% on our test set as compared to a baseline error rate of 18.9%, both using a word bigram. Finally, we explore the integration of scANGIE with a natural language understanding system, resulting in a fully coupled system, based on context-free frameworks for both phonological and linguistic modelling. The integrated system achieves a recognition error rate of 14.8% on the same test, an improvement of 21.6%. We will also discuss two pilot studies, one on handling dynamic vocabulary updates within a continuous speech recognizer and the second on hierarchical duration modelling within a word-spotter. Both studies showed promising results. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Lalit R. Bahl,et al.  A fast approximate acoustic match for large vocabulary speech recognition , 1989, IEEE Trans. Speech Audio Process..

[3]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Herbert Gish,et al.  Phonetic training and language modeling for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  David Goddeau,et al.  Using probabilistic shift-reduce parsing in speech recognition systems , 1992, ICSLP.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Ronald A. Cole,et al.  Speech recognition using syllable-like units , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Michael Weintraub,et al.  Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[12]  Victor Zue,et al.  GALAXY: a human-language interface to on-line travel information , 1994, ICSLP.

[13]  Richard Lippmann,et al.  Hybrid neural-network/HMM approaches to wordspotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[15]  Eric I-Chao Chang Improving wordspotting performance with limited training data , 1995 .

[16]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[17]  Kazuyo Tanaka,et al.  Detection of unknown words in large vocabulary speech recognition , 1993, EUROSPEECH.

[18]  Stephanie Seneff,et al.  ANGIE: a new framework for speech analysis based on morpho-phonological modelling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Francisco Casacuberta,et al.  Learning structural models of subword units through grammatical inference techniques , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Renato De Mori,et al.  The use of syllable phonotactics for word hypothesization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Stephanie Seneff,et al.  Phonological Parsing for Bi-directional Letter-to-Sound/Sound-to-Letter Generation , 1994, HLT.

[22]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[23]  Kenneth Ward Church Phrase-structure parsing: a method for taking advantage of allophonic constraints , 1983 .

[24]  Rhys James Jones,et al.  Continuous speech recognition using syllables , 1997, EUROSPEECH.

[25]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[26]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[27]  李幼升,et al.  Ph , 1989 .

[28]  Helen Meng,et al.  The Use of Distinctive Features for Automatic Speech Recognition , 1991 .

[29]  Nancy A. Daly Acoustic-phonetic and linguistic analyses of spontaneous speech: implications for speech understanding , 1994 .

[30]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[31]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[32]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[33]  Victor Zue,et al.  Integrating natural language into the word graph search for simultaneous speech recognition and understanding , 1995, EUROSPEECH.

[34]  Stephanie Seneff,et al.  Hierarchical duration modelling for speech recognition using the ANGIE framework , 1997, EUROSPEECH.

[35]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[36]  Douglas B. Paul An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[37]  Rachida El Me ACCURATE KEYWORD SPOTTING USING STRICTLY LEXICAL FILLERS , 1997 .

[38]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[39]  Andrej Ljolje,et al.  High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[40]  Alexandros Sterios Manos,et al.  A study on out-of-vocabulary word modelling for a segment-based keyword spotting system , 1996 .

[41]  Mark A. Randolph,et al.  Syllable-based constraints on properties of English sounds , 1989 .

[42]  Kuldip K. Paliwal,et al.  Design of a speech recognition system based on acoustically derived segmental units , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[43]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Hermann Ney,et al.  An Overview of the Philips Research System for Large Vocabulary Continuous Speech Recognition , 1994, Int. J. Pattern Recognit. Artif. Intell..

[45]  Francis X. Katamba,et al.  Modern Irish: Introduction to Phonology , 2008 .

[46]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[47]  Douglas B. Paul,et al.  An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[48]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[49]  Lynette Hirschman,et al.  Multi-Site Data Collection for a Spoken Language Corpus , 1992, HLT.

[50]  Richard C. Rose Definition of subword acoustic units for wordspotting , 1993, EUROSPEECH.

[51]  Victor Zue,et al.  Reversible letter-to-sound/sound-to-letter generation based on parsing word morpology , 1993, Speech Commun..

[52]  P. Kiparsky From cyclic phonology to lexical phonology , 1982 .

[53]  James Glass,et al.  Integration of speech recognition and natural language processing in the MIT VOYAGER system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[54]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Luis A. Hernández Gómez,et al.  Context modeling using RNN for keyword detection , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[56]  M HERSTEIN,et al.  Let's talk! , 1956, The American journal of nursing.

[57]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[58]  James Glass,et al.  The SUMMIT speech recognition system: phonological modelling and lexical access , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[59]  Lori Lamel,et al.  Speaker-independent continuous speech dictation , 1993, Speech Communication.

[60]  Mehryar Mohri,et al.  Weighted determinization and minimization for large vocabulary speech recognition , 1997, EUROSPEECH.

[61]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[62]  Goopeel Chung Hierarchical Duration Modelling for a Speech Recognition System , 1997 .

[63]  Victor Zue,et al.  New words: implications for continuous speech recognition , 1993, EUROSPEECH.

[64]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[65]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[66]  S. Rieck,et al.  Acoustic modelling of subword units in the Isadora speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[67]  Lotfi A. Zadeh,et al.  Phonological structures for speech recognition , 1989 .

[68]  K.-F. Lee,et al.  CMU robust vocabulary-independent speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.