Learning word meanings and grammar for verbalization of daily life activities using multilayered multimodal latent Dirichlet allocation and Bayesian hidden Markov models

Intelligent systems need to understand and respond to human words to enable them to interact with humans in a natural way. Several studies attempted to realize these abilities by investigating the symbol grounding problem. For example, we proposed multilayered multimodal latent Dirichlet allocation (mMLDA) to enable the formation of various concepts and inference using grounded concepts. We previously reported on the issue of connecting words to various hierarchical concepts and also proposed a simple preliminary algorithm for generating sentences. This paper proposes a novel method that enables a sensing system to verbalize an everyday scene it observes. The method uses mMLDA and Bayesian hidden Markov models (BHMM) and the proposed algorithm improves the word inference of our previous work. The advantage of our approach is that grammar learning based on BHMM not only boosts concept selection results but enables our method to process functional words. The proposed verbalization algorithm produces results that are far superior to those of previous methods. Finally, we developed a system to obtain multimodal data from human everyday activities. We evaluate language learning and sentence generation as a complete process under this realistic setting. The results demonstrate the effectiveness of our method. Graphical Abstract

[1]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[2]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[3]  Tomoaki Nakamura,et al.  Grounding of word meanings in multimodal concepts using LDA , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Lubica Benuskova,et al.  Mapping sensorimotor sequences to word sequences: A connectionist model of language acquisition and sentence generation , 2012, Cognition.

[5]  Patrizia Grifoni,et al.  A Learning Algorithm for Multimodal Grammar Inference , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[7]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Tao Xiang,et al.  Learning Multimodal Latent Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Hiroshi Ishiguro,et al.  Laser tracking of human body motion using adaptive shape modeling , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[13]  Yoshihiko Nakamura,et al.  Generating sentence from motion by using large-scale and high-order N-grams , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[16]  Yoshihiko Nakamura,et al.  Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots , 2012, 2012 IEEE International Conference on Robotics and Automation.

[17]  Faizan Javed,et al.  A memetic grammar inference algorithm for language learning , 2012, Appl. Soft Comput..

[18]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Minoru Asada,et al.  Computational model for syntactic development: Identifying how children learn to generalize nouns and verbs for different languages , 2014, 4th International Conference on Development and Learning and on Epigenetic Robotics.

[20]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[22]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Tomoaki Nakamura,et al.  Integration of various concepts and grounding of word meanings using multi-layered multimodal LDA for sentence generation , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[25]  Haibo He,et al.  Learning Race from Face: A Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yiannis Aloimonos,et al.  A Corpus-Guided Framework for Robotic Visual Perception , 2011, Language-Action Tools for Cognitive Artificial Agents.

[27]  Tomoaki Nakamura,et al.  Integrated concept of objects and human motions based on multi-layered multimodal LDA , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Mark Steedman,et al.  A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings , 2012, EACL.

[29]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[30]  Daichi Mochihashi,et al.  Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models , 2015, ACL.

[31]  Trevor Darrell,et al.  Factorized Multi-Modal Topic Model , 2012, UAI.

[32]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[33]  Tomoaki Nakamura,et al.  Multimodal object categorization by a robot , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[35]  Jeffrey Mark Siskind,et al.  A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video , 2015, J. Artif. Intell. Res..

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.