论文信息 - CNN-Based Phoneme Classifier from Vocal Tract MRI Learns Embedding Consistent with Articulatory Topology

CNN-Based Phoneme Classifier from Vocal Tract MRI Learns Embedding Consistent with Articulatory Topology

Recent advances in real-time magnetic resonance imaging (rtMRI) of the vocal tract provides opportunities for studying human speech. This modality together with acquired speech may enable the mapping of articulatory configurations to acoustic features. In this study, we take the first step by training a deep learning model to classify 27 different phonemes from midsagittal MR images of the vocal tract. An American English database was used to train a convolutional neural network for classifying vowels (13 classes), consonants (14 classes) and all phonemes (27 classes) of 17 subjects. Classification top-1 accuracy of the test set for all phonemes was 57%. Error analysis showed voiced and unvoiced sounds often being confused. Moreover, we performed principal component analysis on the network's embedding and observed topological similarities between the network learned representation and the vowel diagram. Saliency maps gave insight into the anatomical regions most important for classification and show congruence with known regions of articulatory importance. We demonstrate the feasibility for deep learning to distinguish between phonemes from MRI. Network analysis can be used to improve understanding of normal articulation and speech and, in the future, impaired speech. This study brings us a step closer to the articulatory-to-acoustic mapping from rtMRI.

[1] António J. S. Teixeira,et al. Unsupervised segmentation of the vocal tract from real-time MRI sequences , 2015, Comput. Speech Lang..

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Pramit Saha,et al. Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI , 2018, INTERSPEECH.

[4] Shrikanth S. Narayanan,et al. Database of Volumetric and Real-Time Vocal Tract MRI for Speech Science , 2017, INTERSPEECH.

[5] Yoon-Chul Kim,et al. Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[6] Athanasios Katsamanis,et al. A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[7] Shrikanth Narayanan,et al. Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[8] E. Vajda. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[9] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.