The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks

This paper describes our recent improvements to IBM TRANSTAC speech-to-speech translation systems that address various issues arising from dealing with resource-constrained tasks, which include both limited amounts of linguistic resources and training data, as well as limited computational power on mobile platforms such as smartphones. We show how the proposed algorithms and methodologies can improve the performance of automatic speech recognition, statistical machine translation, and text-to-speech synthesis, while achieving low-latency two-way speech-to-speech translation on mobiles.

[1]  Chao Wang,et al.  Chinese Syntactic Reordering for Statistical Machine Translation , 2007, EMNLP.

[2]  Tanja Schultz,et al.  Incremental Adaptation of Speech-to-Speech Translation , 2009, NAACL.

[3]  Xiaodong Cui,et al.  Clustering of bootstrapped acoustic model with full covariance , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Xiaodong Cui,et al.  Acoustic Modeling with Bootstrap and Restructuring Based on Full Covariance , 2011, INTERSPEECH.

[5]  Xiaodong Cui,et al.  High-performance low-latency speech recognition via multi-layered feature streaming and fast Gaussian computation , 2008, INTERSPEECH.

[6]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Xiaodong Cui,et al.  Stereo-based stochastic mapping with discriminative training for noise robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Rohit Prasad,et al.  The BBN 2007 displayless English/iraqi speech-to-speech translation system , 2007, INTERSPEECH.

[11]  Bowen Zhou,et al.  Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages , 2010, ACL.

[12]  Brian A. Weiss,et al.  Evaluating speech translation systems: applying SCORE to TRANSTAC technologies , 2009, PerMIS.

[13]  Bowen Zhou,et al.  An EM algorithm for SCFG in formal syntax-based translation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Haizhou Li,et al.  The Asian network-based speech-to-speech translation system , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Daniel Povey,et al.  Improvements to fMPE for discriminative training of features , 2005, INTERSPEECH.

[16]  Peng Xu,et al.  Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages , 2009, NAACL.

[17]  Wei Zhang,et al.  Toward multiple-language TTS: experiments in English and Mandarin , 2005, INTERSPEECH.

[18]  Gregory A. Sanders,et al.  Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics , 2011, Machine Translation.

[19]  Roger K. Moore Computer Speech and Language , 1986 .

[20]  Kevin Knight,et al.  11,001 New Features for Statistical Machine Translation , 2009, NAACL.

[21]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[22]  Simon King,et al.  IEEE Workshop on automatic speech recognition and understanding , 2009 .

[23]  Bowen Zhou,et al.  An Empirical Study on Improving Hierarchical Phrase-Based Translation Using Alignment Features , 2011, INTERSPEECH.

[24]  Bowen Zhou,et al.  Using Features from Topic Models to Alleviate Over-Generation in Hierarchical Phrase-Based Translation , 2011, INTERSPEECH.

[25]  Bowen Zhou,et al.  Enriching SCFG rules directly from efficient bilingual chart parsing , 2009, IWSLT.

[26]  Daniel Marcu,et al.  Transonics: A Practical Speech-to-Speech Translator for English-Farsi Medical Dialogs , 2005, ACL.

[27]  Steve Renals Proc. NAACL/HLT , 2010 .

[28]  Raj Madhavan,et al.  Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems, PerMIS 2008, Gaithersburg, Maryland, USA, August 19-21, 2008 , 2008, PerMIS.

[29]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[30]  David Yarowsky,et al.  Improving Bitext Word Alignments via Syntax-based Reordering of English , 2004, ACL.

[31]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[32]  Oliver Watts,et al.  Unsupervised Features from Text for Speech Synthesis in a Speech-to-Speech Translation System , 2011, INTERSPEECH.

[33]  Bowen Zhou,et al.  Efficient representation and fast look-up of Maximum Entropy language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[34]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[35]  John R. Hershey,et al.  Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Bowen Zhou,et al.  FOLSOM: A FAST AND MEMORY-EFFICIENT PHRASE-BASED APPROACH TO STATISTICAL MACHINE TRANSLATION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[37]  Vaibhava Goel,et al.  Optimal quantization and bit allocation for compressing large discriminative feature space transforms , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  Bowen Zhou,et al.  Applying log linear model based context dependent machine translation techniques to grapheme-to-phoneme conversion , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Xiaodong Cui,et al.  Stereo-Based Stochastic Mapping for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Wei Zhang,et al.  Applying scalable phonetic context similarity in unit selection of concatenative text-to-speech , 2010, INTERSPEECH.

[43]  Hermann Ney,et al.  Efficient Phrase-Table Representation for Machine Translation with Applications to Online MT and Speech Translation , 2007, NAACL.

[44]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[45]  Bowen Zhou,et al.  Prior Derivation Models For Formally Syntax-Based Translation Using Linguistically Syntactic Parsing and Tree Kernels , 2008, SSST@ACL.

[46]  Ruhi Sarikaya,et al.  IBM Mastor: Multilingual Automatic Speech-To-Speech Translator , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[47]  Bowen Zhou,et al.  A Power Mean Based Algorithm for Combining Multiple Alignment Tables , 2010, COLING.

[48]  Bing Xiang,et al.  Morphological Decomposition for Arabic Broadcast News Transcription , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[49]  Brian A. Weiss,et al.  Evaluation methodology and metrics employed to assess the TRANSTAC two-way, speech-to-speech translation systems , 2013, Comput. Speech Lang..

[50]  Wei Zhang,et al.  Recent improvements of Probability Based Prosody Models for Unit Selection in concatenative Text-to-Speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[52]  Bowen Zhou,et al.  Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions , 2010, EMNLP.

[53]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[54]  Franck Thollard,et al.  Proceedings of COLING , 2004 .

[55]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[56]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[57]  Xiaodong Cui,et al.  Improving online incremental speaker adaptation with eigen feature space MLLR , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[58]  Xiaodong Cui,et al.  MMSE-based stereo feature stochastic mapping for noise robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Yi Su,et al.  Investigating linguistic knowledge in a maximum entropy token-based language model , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[60]  Kristin Precoda,et al.  Iraqcomm: a next generation translation system , 2007, INTERSPEECH.

[61]  Qin Gao,et al.  Reassessment of the role of phrase extraction in pbsmt , 2009 .

[62]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.