Using graphone models in automatic speech recognition

This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown words. For these experiments, graphones models are integrated into the SUMMIT speech recognition framework. First, graphones are applied to automatically generate pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words. Next, a graphone hybrid recognizer is built and tested for searching song lyrics by voice, as well as transcribing spoken lectures in a open vocabulary scenario. These experiments demonstrate significant improvement over traditional word-only speech recognizers. Modifications to the flat hybrid model such as reducing the graphone set size are also considered. Finally, a hierarchical hybrid model is built and compared with the flat hybrid model on the lecture transcription task. Thesis Supervisor: James R. Glass Title: Principal Research Scientist Thesis Co-Supervisor: I. Lee Hetherington Title: Research Scientist

[1]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[2]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[3]  Grace Chung A three-stage solution for flexible vocabulary speech understanding , 2000, INTERSPEECH.

[4]  Stephanie Seneff,et al.  The use of subword linguistic modeling for multiple tasks in speech recognition , 2004, Speech Commun..

[5]  Walter Daelemans,et al.  Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[6]  Ciro Martins,et al.  Vocabulary selection for a broadcast news transcription system using a morpho-syntactic approach , 2007, INTERSPEECH.

[7]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  James R. Glass,et al.  Heterogeneous lexical units for automatic speech recognition: preliminary investigations , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[10]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  James R. Glass,et al.  Learning units for domain-independent out-of- vocabulary word modelling , 2001, INTERSPEECH.

[12]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[13]  Rodney W. Johnson,et al.  Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[14]  Ghinwa F. Choueiter Linguistically-motivated sub-word modeling with applications to speech recognition , 2008 .

[15]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[16]  Marcello Federico,et al.  Coping with out-of-vocabulary words: Open versus huge vocabulary asr , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Kari Torkkola An efficient way to learn English grapheme-to-phoneme rules automatically , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Timothy J. Hazen,et al.  Recognition Confidence Scoring for Use in Speech Understanding Systems , 2000 .

[19]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[20]  I. Lee Hetherington A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[21]  Ryan Thomas,et al.  Grapheme to phoneme conversion and dictionary verification using graphonemes , 2003, INTERSPEECH.

[22]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[23]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[24]  Stephanie Seneff,et al.  Phonological Parsing for Bi-directional Letter-to-Sound/Sound-to-Letter Generation , 1994, HLT.

[25]  Hermann Ney,et al.  Investigations on joint-multigram models for grapheme-to-phoneme conversion , 2002, INTERSPEECH.

[26]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[27]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.

[28]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[29]  Stephanie Seneff Reversible Sound-to-Letter/Letter-to-Sound Modeling Based on Syllable Structure , 2007, HLT-NAACL.

[30]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[31]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[32]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[33]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[35]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  James R. Glass,et al.  Automatic lexical pronunciations generation and update , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[37]  Keith Vertanen Combining open vocabulary recognition and word confusion networks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  I. Lee Hetherington The MIT finite-state transducer toolkit for speech and language processing , 2004, INTERSPEECH.

[39]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[40]  Hermann Ney,et al.  Multigram-based grapheme-to-phoneme conversion for LVCSR , 2003, INTERSPEECH.

[41]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[42]  Georges Linarès,et al.  On-demand new word learning using world wide web , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.