SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.

[1]  Dragos Burileanu,et al.  An advanced NLP framework for high-quality Text-to-Speech synthesis , 2011, 2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[2]  David Suendermann-Oeft,et al.  Localization of speech recognition in spoken dialog systems: how machine translation can make our lives easier , 2009, INTERSPEECH.

[3]  Tibor Fegyó,et al.  A morpho-graphemic approach for the recognition of spontaneous speech in agglutinative languages - like Hungarian , 2007, INTERSPEECH.

[4]  Sadaoki Furui,et al.  Development of a speech recognition system for Icelandic using machine translated text , 2008, SLTU.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Dan Tufis,et al.  DIAC+: a Professional Diacritics Recovering System , 2008, LREC.

[7]  Horia Cucu,et al.  ASR domain adaptation methods for low-resourced languages: Application to Romanian language , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[8]  Toderean Gavril,et al.  Automated grapheme-to-phoneme conversion system for Romanian , 2011 .

[9]  Patrizia Bonaventura,et al.  Grapheme-to-phoneme transcription rules for Spanish, with application to automatic speech recognition and synthesis , 1998 .

[10]  Horia CUCU,et al.  OPTIMIZATION METHODS FOR LARGE VOCABULARY , ISOLATED WORDS RECOGNITION IN ROMANIAN LANGUAGE , 2011 .

[11]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[12]  Hermann Ney,et al.  Multigram-based grapheme-to-phoneme conversion for LVCSR , 2003, INTERSPEECH.

[13]  Mircea Giurgiu,et al.  A Romanian corpus for speech perception and automatic speech recognition , 2011 .

[14]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[16]  Barry Haddow,et al.  Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[17]  Jean-François Bonastre,et al.  Automatic transcription of Somali language , 2006, INTERSPEECH.

[18]  Vincent Berment,et al.  Méthodes pour informatiser les langues et les groupes de langues « peu dotées ». (Methods to computerize "little equipped" languages and groups of languages) , 2004 .

[19]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[20]  C. Negrescu,et al.  AUTOMATIC DIACRITIC RESTORATION FOR A TTS-BASED E-MAIL READER APPLICATION , 2008 .

[21]  Jun Cai,et al.  Transcribing Southern Min speech corpora with a Web-Based language learning system , 2008, 2008 International Conference on Audio, Language and Image Processing.

[22]  Fabrice Lefèvre,et al.  Combination of stochastic understanding and machine translation systems for language portability of dialogue systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Laurent Besacier,et al.  Unsupervised acoustic model adaptation for multi-origin non native ASR , 2010, INTERSPEECH.

[24]  Horia Cucu,et al.  Investigating the role of machine translated text in ASR domain adaptation: Unsupervised and semi-supervised methods , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[25]  Thomas Pellegrini,et al.  Investigating automatic decomposition for ASR in less represented languages , 2006, INTERSPEECH.

[26]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[27]  Horia Cucu,et al.  Enhancing Automatic Speech Recognition for Romanian by Using Machine Translated and Web-based Text Corpora , 2011 .

[28]  Horia Cucu,et al.  Speech Recognition Experimental Results for Romanian Language , 2013 .

[29]  Melania Duma,et al.  Enhanced Rule-Based Phonetic Transcription for the Romanian Language , 2009, 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[30]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Toma Stefan-Adrian,et al.  Rule-Based Automatic Phonetic Transcription for the Romanian Language , 2009, 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

[32]  Christoph Draxler On web-based creation of speech resources for less-resourced languages , 2007, INTERSPEECH.

[33]  Lori Lamel,et al.  Comparing SMT Methods for Automatic Generation of Pronunciation Variants , 2010, IceTAL.

[34]  Svetlana Segarceanu,et al.  ProtoLOGOS, system for Romanian language automatic speech recognition and understanding (ASRU) , 2009, 2009 Proceedings of the 5-th Conference on Speech Technology and Human-Computer Dialogue.

[35]  Mihai Mitrea,et al.  Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included , 2007, ICCSA.

[36]  Inge Gavat,et al.  Progress in Speech Recognition for Romanian Language , 2008 .

[37]  A. Kilgarriff,et al.  THE ROWAC CORPUS AND ROMANIAN WORD SKETCHES , 2016 .

[38]  J. Xu,et al.  Audio Indexing of Arabic broadcast news , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Taro Watanabe,et al.  Language Model Adaptation with Additional Text Generated by Machine Translation , 2002, COLING.

[40]  Paul Deléglise,et al.  Grapheme to phoneme conversion using an SMT system , 2009, INTERSPEECH.

[41]  Laurent Besacier,et al.  Using the web for fast language model construction in minority languages , 2003, INTERSPEECH.

[42]  Solomon Gizaw Multiple pronunciation model for Amharic speech recognition system , 2008, SLTU.

[43]  Sebastian Stüker Integrating Thai grapheme based acoustic models into the ML-MIX framework - for language independent and cross-language ASR , 2008, SLTU.

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[45]  M. Sima,et al.  A PHONETIC CONVERTER FOR SPEECH SYNTHESIS IN ROMANIAN ' UDJRú % , 1999 .

[46]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.