Automatic Visual Augmentation for Concatenation Based Synthesized Articulatory Videos from Real-time MRI Data for Spoken Language Training

For the benefit of spoken language training, concatenation based articulatory video synthesis has been proposed in the past to overcome the limitation in the articulatory data recording. For this, real time magnetic resonance imaging (rt-MRI) video image-frames (IFs) containing articulatory movements have been used. These IFs require a visual augmentation for better understanding. We, in this work, propose an augmentation method using pixel intensities in the regions enclosed by the articulatory boundaries obtained from air-tissue boundaries (ATBs). Since, the pixel intensities reflect the muscle movements in the articulators, the augmented IFs could provide realistic articulatory movements, when we color them accordingly. However, the ATB manual annotation is time consuming; hence, we propose to synthesize ATBs using the ATBs from a few selected frames that have been used in synthesizing the articulatory videos. We augment a set of synthesized articulatory videos for 50 words obtained from the MRI-TIMIT database. Subjective evaluation on the quality of the augmented videos using twenty-one subjects suggests that the videos are visually more appealing than the respective synthesized rt-MRI videos with a rating of 3.75 out of 5, where a score of 5 (1) indicates that the augmented video quality is excellent (poor).

[1]  Olov Engwall,et al.  Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes , 2008, INTERSPEECH.

[2]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[3]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[4]  Gérard Bailly,et al.  Visual articulatory feedback for phonetic correction in second language learning , 2010 .

[5]  Gérard Chollet,et al.  Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application , 2008 .

[6]  Chiranjeevi Yarra,et al.  A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Oliver Bimber,et al.  The virtual showcase as a new platform for augmented reality digital storytelling , 2003, IPT/EGVE.

[8]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[9]  Bernd J. Kröger,et al.  Two- and three-dimensional visual articulatory models for pronunciation training and for treatment of speech disorders , 2008, INTERSPEECH.

[10]  Prasanta Kumar Ghosh,et al.  Optimal sensor placement in electromagnetic articulography recording for speech production study , 2018, Comput. Speech Lang..

[11]  Gérard Bailly,et al.  An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data , 2008, AMDO.

[12]  Gérard Bailly,et al.  Speech technologies for augmented communication , 2010 .

[13]  Imaging: seeing muscle in new ways , 2014, Current opinion in rheumatology.

[14]  Helmer Strik,et al.  Feedback in computer assisted pronunciation training: technology push or demand pull? , 2002, INTERSPEECH.

[15]  Chiranjeevi Yarra,et al.  Concatenative Articulatory Video Synthesis Using Real-Time MRI Data for Spoken Language Training , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[17]  Yoon-Chul Kim,et al.  Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[18]  Chiranjeevi Yarra,et al.  A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection , 2016, Speech Commun..

[19]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[20]  Thomas Hueber Ultraspeech-player: intuitive visualization of ultrasound articulatory data for speech therapy and pronunciation training , 2013, INTERSPEECH.

[21]  Maxine Eskénazi,et al.  An overview of spoken language technology for education , 2009, Speech Commun..

[22]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[23]  Chiranjeevi Yarra,et al.  Automatic detection of syllable stress using sonority based prominence features for pronunciation evaluation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Joanna Light,et al.  Using visible speech to train perception and production of speech for individuals with hearing loss. , 2004, Journal of speech, language, and hearing research : JSLHR.

[25]  Sophocles J. Orfanidis,et al.  Introduction to signal processing , 1995 .