Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts

In the absence of dictionaries, translators, or grammars, it is still possible to learn some of the words of a new language by listening to spoken descriptions of images. If several images, each containing a particular visually salient object, each co-occur with a particular sequence of speech sounds, we can infer that those speech sounds are a word whose definition is the visible object. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcriptions) and learns a mapping from waveform segments (or phone strings) to their associated image concepts. In this article, four multimodal word discovery systems are demonstrated: three models based on statistical machine translation (SMT) and one based on neural machine translation (NMT). The systems are trained with phonetic transcriptions, MFCC and multilingual bottleneck features (MBN). On the phone-level, the SMT outperforms the NMT model, achieving a 61.6% F1 score in the phone-level word discovery task on Flickr30k. On the audio-level, we compared our models with the existing ES-KMeans algorithm for word discovery and present some of the challenges in multimodal spoken word discovery.

[1]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[2]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[3]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[4]  Xiaohui Zhang,et al.  The Kaldi OpenKWS System: Improving Low Resource Keyword Search , 2017, INTERSPEECH.

[5]  Aline Villavicencio,et al.  Unsupervised Word Segmentation from Speech with Attention , 2018, INTERSPEECH.

[6]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[7]  James R. Glass,et al.  Towards Visually Grounded Sub-word Speech Unit Discovery , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[9]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[10]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[12]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[13]  Mark Hasegawa-Johnson,et al.  Sparse hidden Markov models for purer clusters , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[15]  Sam T. Roweis,et al.  EM Algorithms for PCA and Sensible PCA , 1997, NIPS 1997.

[16]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17]  Aleksei Romanenko,et al.  Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation , 2017, SPECOM.

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Mark Hasegawa-Johnson,et al.  Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts , 2019, INTERSPEECH.

[21]  Aren Jansen,et al.  Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Gregory Shakhnarovich,et al.  Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[24]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[26]  D. E. Irwin,et al.  Minding the clock , 2003 .

[27]  Mark Hasegawa-Johnson,et al.  Bayesian Models for Unit Discovery on a Very Low Resource Language , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[30]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[31]  Tanja Schultz,et al.  Word segmentation through cross-lingual word-to-phoneme alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[32]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[33]  Frantisek Grézl,et al.  Multilingually trained bottleneck features in spoken language recognition , 2017, Comput. Speech Lang..

[34]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[35]  Natalia A. Tomashenko,et al.  Fast and Accurate OOV Decoder on High-Level Features , 2017, INTERSPEECH.

[36]  Richard M. Schwartz,et al.  The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[39]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[40]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[41]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).