Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language

Abstract For automatic speech recognition (ASR) of agglutinative languages, selection of a lexical unit is not obvious. The morpheme unit is usually adopted to ensure sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusion. We propose a discriminative approach for lexicon optimization that directly contributes to ASR error reduction by taking into account not only linguistic constraints but also acoustic–phonetic confusability. It is based on an evaluation function for each word defined by a set of features and their weights, which are optimized by the difference in word error rates (WERs) between ASR hypotheses obtained by the morpheme-based model and those by the word-based model. Then, word or sub-word entries with higher evaluation scores are selected to be added to the lexicon. We investigate several discriminative models to realize this approach. Specifically, we implement it with support vector machines (SVM), logistic regression (LR) model as well as the simple perceptron algorithm. This approach was successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER with a modest lexicon size and a small out-of-vocabulary rate. The use of SVM for a sub-word lexicon results in the best performance, outperforming the word-based model as well as conventional statistical concatenation approaches. The proposed learning approach is realized in an unsupervised manner because it does not require correct transcription for training data.

[1]  Ruhi Sarikaya,et al.  Joint Morphological-Lexical Language Modeling for Processing Morphologically Rich Languages With Application to Dialectal Arabic , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Oh-Wook Kwon Performance of LVCSR with morpheme-based and syllable-based recognition units , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[5]  Hermann Ney,et al.  Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR , 2009, INTERSPEECH.

[6]  Oh-Wook Kwon,et al.  Korean large vocabulary continuous speech recognition with morpheme-based recognition units , 2003, Speech Commun..

[7]  Tanja Schultz,et al.  Turkish LVCSR: towards better speech recognition for agglutinative languages , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[10]  Hong-Kwang Jeff Kuo,et al.  Phrase-based language models for speech recognition , 1999, EUROSPEECH.

[11]  Brian Roark,et al.  Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.

[12]  Ebru Arisoy,et al.  Discriminative Language Modeling With Linguistic and Statistically Derived Features , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Mathias Creutz,et al.  Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition , 2006 .

[14]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[15]  George Saon,et al.  Data-driven approach to designing compound words for continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[16]  Tanja Schultz,et al.  Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR , 1999 .

[17]  Mikko Kurimo,et al.  Vocabulary Decomposition for Estonian Open Vocabulary Speech Recognition , 2007, ACL.

[18]  Sadaoki Furui,et al.  A new lexicon optimization method for LVCSR based on linguistic and acoustic characteristics of words , 2002, INTERSPEECH.

[19]  Thomas Pellegrini,et al.  Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language , 2007, INTERSPEECH.

[20]  Hermann Ney,et al.  Compound Word Recombination for German LVCSR , 2011, INTERSPEECH.

[21]  Murat Saraclar,et al.  Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Mikko Kurimo,et al.  On lexicon creation for turkish LVCSR , 2003, INTERSPEECH.

[23]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[25]  Tatsuya Kawahara,et al.  Uyghur morpheme-based language models and ASR , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[26]  Thomas Pellegrini,et al.  Automatic Word Decompounding for ASR in a Morphologically Rich Language: Application to Amharic , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Tatsuya Kawahara,et al.  Discriminative approach to lexical entry selection for automatic speech recognition of agglutinative language , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[29]  Sadaoki Furui,et al.  Lexical units for Thai LVCSR , 2009, Speech Commun..

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  Petra Geutner,et al.  Using morphology towards better large-vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Bing Xiang,et al.  Morphological Decomposition for Arabic Broadcast News Transcription , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[34]  Ebru Arisoy,et al.  Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages , 2007, HLT-NAACL.

[35]  Tibor Fegyó,et al.  A morpho-graphemic approach for the recognition of spontaneous speech in agglutinative languages - like Hungarian , 2007, INTERSPEECH.

[36]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[37]  Nobuaki Minematsu,et al.  Free software toolkit for Japanese large vocabulary continuous speech recognition , 2000, INTERSPEECH.

[38]  Ruhi Sarikaya,et al.  On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[39]  Yoshinori Sagisaka,et al.  Variable-order N-gram generation by word-class splitting and consecutive word grouping , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.