Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation

In spoken communication, a speaker may convey their mes-sage in words (linguistic cues) with supplemental information (paralinguistic cues) such as emotion and emphasis. Trans-forming all spoken information into a written or verbal form is not trivial, especially if the transformation has to be done across languages. Most existing speech-to-text translation systems focus only on translating linguistic information while ig-noring paralinguistic information. A few recent studies that proposed paralinguistic translation used a machine translation with hidden Markov model (HMM)-based automatic speech recognition (ASR) and text-to-speech (TTS) that were complicated and suboptimal. Furthermore, paralinguistic information was kept in the acoustic form. Here, we focused on transcribing paralinguistic acoustic cues of emphasis in the target language text. Specifically, we constructed cascade and direct neural Transformer-based speech-to-text translation, and we investigated various methods of expressing emphasis information in the written form of the target language. We performed our experiments on a Japanese-to-English linguistic and paralinguistic speech-to-text translation framework. The results revealed that our proposed method can translate both linguistic and paralinguistic information while keeping the performance as in stan-dard linguistic translation.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Sergey Rybin,et al.  You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation , 2020, 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[3]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Tomoki Toda,et al.  A method for translation of paralinguistic information , 2012, IWSLT.

[5]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[6]  Ulrich Amsel The Oxford Dictionary Of English Grammar , 2016 .

[7]  Tomoki Toda,et al.  Generalizing continuous-space translation of paralinguistic information , 2013, INTERSPEECH.

[8]  Satoshi Nakamura,et al.  Toward Multi-Features Emphasis Speech Translation: Assessment of Human Emphasis Production and Perception with Speech and Text Clues , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9]  A. Athanasiadou On the subjectivity of intensifiers , 2007 .

[10]  D. Willett,et al.  Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Panayiotis G. Georgiou,et al.  Toward transfer of acoustic cues of emphasis across languages , 2013, INTERSPEECH.

[14]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[15]  Satoshi Nakamura,et al.  Sequence-to-Sequence Models for Emphasis Speech Translation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Xiao Han,et al.  Emotional Speech Recognition and Synthesis in Multiple Languages toward Affective Speech-to-Speech Translation System , 2014, 2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[17]  Peter Siemund,et al.  Intensifiers and reflexives , 2000 .

[18]  Renaat Declerck A comprehensive descriptive grammar of English , 1991 .

[19]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[20]  Tomoki Toda,et al.  Preserving Word-Level Emphasis in Speech-to-Speech Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Alan W. Black,et al.  Intent transfer in speech-to-speech machine translation , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[22]  Eiichiro Sumita,et al.  Comparative study on corpora for speech translation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[26]  Raymond Chakhachiro Contribution of prosodic and paralinguistic cues to the translation of evidentiary audio recordings , 2016 .