Towards Visually Grounded Sub-word Speech Unit Discovery

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.

[1]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[2]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[5]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[6]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[7]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[8]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[9]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[11]  Emmanuel Dupoux,et al.  Learning Words from Images and Speech , 2014 .

[12]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[13]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[14]  Virginia R. de Sa,et al.  Learning Classification with Unlabeled Data , 1993, NIPS.

[15]  James Glass,et al.  Analysis of Audio-Visual Features for Unsupervised Speech Recognition , 2017 .

[16]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[17]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Aren Jansen,et al.  Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Grzegorz Chrupala,et al.  Encoding of phonology in a recurrent neural model of grounded speech , 2017, CoNLL.

[20]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[23]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[24]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[25]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[26]  Okko Johannes Räsänen,et al.  Blind Phoneme Segmentation With Temporal Prediction Errors , 2016, ACL.

[27]  Okko Johannes Räsänen,et al.  Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level , 2014, CogSci.

[28]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[29]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[31]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[32]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.