Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages

We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by concatenating morphs. We show that the morph LMs generally outperform the word LMs and that they perform fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.

[1]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[2]  Philip C. Woodland,et al.  Particle-based language modelling , 2000, INTERSPEECH.

[3]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  李幼升,et al.  Ph , 1989 .

[5]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[6]  Ebru Arisoy,et al.  Unsupervised segmentation of words into morphemes - morpho challenge 2005 application to automatic speech recognition , 2006, INTERSPEECH.

[7]  Mathias Creutz,et al.  Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition , 2006 .

[8]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[9]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[10]  Petra Geutner,et al.  Adaptive vocabularies for transcribing multilingual broadcast news , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[12]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[13]  Andreas Stolcke,et al.  Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.

[14]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[15]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[16]  Izhak Shafran,et al.  Corrective Models for Speech Recognition of Inflected Languages , 2006, EMNLP.

[17]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[18]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[19]  Oh-Wook Kwon,et al.  Korean large vocabulary continuous speech recognition with morpheme-based recognition units , 2003, Speech Commun..

[20]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[21]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[22]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[23]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.

[24]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[25]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[26]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[27]  Franciska de Jong,et al.  Compound decomposition in dutch large vocabulary speech recognition , 2003, INTERSPEECH.

[28]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .