A survey of voice translation methodologies — Acoustic dialect decoder

Language Translation has always been about inputting source as text/audio and waiting for system to give translated output in desired form. In this paper, we present the Acoustic Dialect Decoder (ADD) - a voice to voice earpiece translation device. We introduce and survey the recent advances made in the field of Speech Engineering, to employ in the ADD, particularly focusing on the three major processing steps of Recognition, Translation and Synthesis. We tackle the problem of machine understanding of natural language by designing a recognition unit for source audio to text, a translation unit for source language text to target language text, and a synthesis unit for target language text to target language speech. Speech from the surroundings will be recorded by the recognition unit present on the ear-piece and translation will start as soon as one sentence is successfully read. This way, we hope to give translated output as and when input is being read. The recognition unit will use Hidden Markov Models (HMMs) Based Tool-Kit (HTK), RNNs with LSTM cells, and the synthesis unit, HMM based speech synthesis system HTS. This system will initially be built as an English to Tamil translation device.

[1]  Kallirroi Georgila,et al.  Practical Evaluation of Speech Recognizers for Virtual Human Dialogue Systems , 2010, LREC.

[2]  Oliver Chiu-sing Choy,et al.  An efficient MFCC extraction method in speech recognition , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[3]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[4]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5]  Yoshua Bengio,et al.  Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation , 2014, SSST@EMNLP.

[6]  B Shruti Hiregoudar A SURVEY: RESEARCH SUMMARY ON NEURAL NETWORKS , 2014 .

[7]  Smrithy K Mukundan Shreshta BhashaMalayalam Speech Recognition using HTK , 2014 .

[8]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  S. D. Shirbahadurkar,et al.  Speech modification for prosody conversion in expressive Marathi text-to-speech synthesis , 2014, 2014 International Conference on Signal Processing and Integrated Networks (SPIN).

[11]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[12]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[13]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[14]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[15]  K. C. Rajeswari,et al.  A novel intonation model to improve the quality of tamil text-to-speech synthesis system , 2014, 2014 Sixth International Conference on Advanced Computing (ICoAC).

[16]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[17]  Hermann Ney,et al.  Translation Modeling with Bidirectional Recurrent Neural Networks , 2014, EMNLP.

[18]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[19]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[22]  Mauro Cettolo,et al.  Integrated n-best re-ranking for spoken language translation , 2005, INTERSPEECH.

[23]  T.V. Sreenivas,et al.  Multi Pattern Dynamic Time Warping for automatic speech recognition , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[24]  MPhil PhD Arturo Trujillo BSc Translation Engines: Techniques for Machine Translation , 1999, Applied Computing.

[25]  Sanjay Mathur,et al.  Sanskrit Speech Recognition using Hidden Markov Model Toolkit , 2014 .

[26]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[27]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Wenli Zhou,et al.  A Comparison between HTK and SPHINX on Chinese Mandarin , 2009, 2009 International Joint Conference on Artificial Intelligence.

[30]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[31]  Comparative Analysis of HTK and Sphinx in Vietnamese Speech Recognition , 2015 .

[32]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[33]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[34]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  Youcef Tabet,et al.  Speech synthesis techniques. A survey , 2011, International Workshop on Systems, Signal Processing and their Applications, WOSSPA.

[37]  Mathew Magimai-Doss,et al.  A Study of Phoneme and Grapheme Based Context-Dependent ASR Systems , 2007, MLMI.

[38]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.