Hybrid language models for speech transcription

This paper analyzes the use of hybrid language models for automatic speech transcription. The goal is to later use such an approach as a support for helping communication with deaf people, and to run it on an embedded decoder on a portable device, which introduces constraints on the model size. The main linguistic units considered for this task are the words and the syllables. Various lexicon sizes are studied by setting thresholds on the word occurrence frequencies in the training data, the less frequent words being therefore syllabified. A recognizer using this kind of language model can output between 62% and 96% of words (with respect to the thresholds on the word occurrence frequencies; the other recognized lexical units are syllables). By setting different thresholds on the confidence measures associated to the recognized words, the most reliable word hypotheses can be identified, and they have correct recognition rates between 70% and 92%.

[1]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Robert Sokol Reseaux neuro-flous et reconnaissance des traits phonetiques pour l'aide a la lecture labiale , 1996 .

[4]  Joseph Picone,et al.  Syllable-based large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[5]  Jean-Paul Haton,et al.  Transcription automatique pour malentendants : amélioration à l'aide de mesures de confiance locales , 2008 .

[6]  Solange Rossato,et al.  Comparison of Syllable and Triphone Based Speech Recognition For Amharic , 2011, LTC 2011.

[7]  Louis Boves,et al.  Syllable-Length Acoustic Units in Large-Vocabulary Continuous Speech Recognition , 2005 .

[8]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[9]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[10]  Frédéric Béchet,et al.  The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News , 2010, LREC.

[11]  Denis Jouvet,et al.  Grapheme-to-Phoneme Conversion Using Conditional Random Fields , 2011, INTERSPEECH.

[12]  Roxane Bertrand,et al.  Annotation automatique en syllabes d'un dialogue oral spontané , 2010 .

[13]  Guy Perennou,et al.  BDLEX: a lexicon for spoken and written french , 1998, LREC.

[14]  Bhuvana Ramabhadran,et al.  Towards using hybrid word and fragment units for vocabulary independent LVCSR systems , 2009, INTERSPEECH.

[15]  Mark Wells,et al.  Tessa, a system to aid communication with deaf people , 2002, ASSETS.

[16]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[17]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[18]  Kathryn Woodcock,et al.  Ergonomics and automatic speech recognition applications for deaf and hard-of-hearing users , 1997 .

[19]  Li Zhang,et al.  Speech recognition using syllable patterns , 2002, INTERSPEECH.

[20]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Hermann Ney,et al.  Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR , 2011, INTERSPEECH.

[22]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[23]  Denis Jouvet,et al.  Comparison and Analysis of Several Phonetic Decoding Approaches , 2013, TSD.

[24]  Denis Jouvet,et al.  Comparison of approaches for an efficient phonetic decoding , 2013, INTERSPEECH.

[25]  A. Coursant-Moreau,et al.  LIPCOM, PROLOTYPE D'AIDE AUTOMATIQUE A LA RECEPTION DE LA PAROLE PAR LES PERSONNES SOURDES , 1999 .

[26]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).