Computer Vision and Natural Language Processing

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Noah A. Smith,et al.  Learning Word Representations with Hierarchical Sparse Coding , 2014, ICML.

[3]  Robert Pless,et al.  A Survey of Manifold Learning for Images , 2009, IPSJ Trans. Comput. Vis. Appl..

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[6]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[7]  Mark Steedman,et al.  Combined Distributional and Logical Semantics , 2013, TACL.

[8]  Deb Roy,et al.  Grounded Situation Models for Robots: Where words and percepts meet , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[10]  Dieter Fox,et al.  Attribute based object identification , 2013, 2013 IEEE International Conference on Robotics and Automation.

[11]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[12]  Xiaoou Tang,et al.  A large-scale car dataset for fine-grained categorization and verification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[14]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[15]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[16]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[17]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[18]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[19]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[20]  Jon Oberlander,et al.  Generating Instructions in Virtual Environments (GIVE):A Challenge and an Evaluation Testbed for NLG , 2007 .

[21]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[22]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[23]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[24]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[25]  Yiannis Aloimonos,et al.  Detection of Manipulation Action Consequences (MAC) , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[27]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[28]  Michael Beetz,et al.  Improving robot manipulation through fingertip perception , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Katrin Erk,et al.  A Formal Approach to Linking Logical Form and Vector-Space Lexical Semantics , 2014 .

[30]  Yoshua Bengio,et al.  Deep Architectures for Baby AI , 2007 .

[31]  Yiannis Aloimonos,et al.  The Cognitive Dialogue: A new model for vision implementing common sense reasoning , 2015, Image Vis. Comput..

[32]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[33]  Daniel Marcu,et al.  Parsing English into Abstract Meaning Representation Using Syntax-Based Machine Translation , 2015, EMNLP.

[34]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[35]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[38]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[39]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[40]  Karl Stratos,et al.  Detecting Visual Text , 2012, NAACL.

[41]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[42]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[43]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[44]  Roy Schwartz,et al.  Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction , 2015, CoNLL.

[45]  Graeme Hirst,et al.  Computing Lexical Contrast , 2013, CL.

[46]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[47]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[49]  Stephen Clark,et al.  Combining Symbolic and Distributional Models of Meaning , 2007, AAAI Spring Symposium: Quantum Interaction.

[50]  Paul Strauss,et al.  Foundations Of The Theory Of Signs , 2016 .

[51]  B. Bloom Taxonomy of educational objectives , 1956 .

[52]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  C. Morris Foundations of the theory of signs , 1938 .

[54]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[55]  Hanqing Lu,et al.  What Visual Attributes Characterize an Object Class? , 2014, ACCV.

[56]  Devi Parikh,et al.  Modeling context for image understanding: When, for what, and how? , 2009 .

[57]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[58]  Chitta Baral,et al.  The NL2KR System , 2013, NLPAR@LPNMR.

[59]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[60]  Abhinav Gupta,et al.  3D Shape Attributes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[62]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[63]  Jeffrey Mark Siskind,et al.  Seeing What You're Told: Sentence-Guided Activity Recognition in Video , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[65]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[66]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[67]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[68]  Christopher Ré,et al.  Building a Large-scale Multimodal Knowledge Base for Visual Question Answering , 2015, ArXiv.

[69]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[70]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[71]  Larry S. Davis,et al.  Selecting Relevant Web Trained Concepts for Automated Event Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[72]  Yiannis Aloimonos,et al.  The minimalist grammar of action , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[73]  Ulises Cortés,et al.  Extracting Visual Patterns from Deep Learning Representations , 2015, ArXiv.

[74]  Nikolaos Mavridis,et al.  A review of verbal and non-verbal human-robot interactive communication , 2014, Robotics Auton. Syst..

[75]  Yiannis Aloimonos,et al.  Shadow free segmentation in still images using local density measure , 2014, 2014 IEEE International Conference on Computational Photography (ICCP).

[76]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[77]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[78]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[79]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[80]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[81]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[83]  Oren Etzioni,et al.  Machine Reading , 2006, AAAI.

[84]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[85]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[86]  A. S. M. Ashique Mahmood,et al.  Literature Survey on Topic Modeling , 2013 .

[87]  Hoifung Poon,et al.  Unsupervised Semantic Parsing , 2009, EMNLP.

[88]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[89]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[90]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[91]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[92]  C. Alberini,et al.  Memory , 2006, Cellular and Molecular Life Sciences CMLS.

[93]  T. Plate A Common Framework for Distributed Representation Schemes for Compositional Structure , 1997 .

[94]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[95]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[96]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[97]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[98]  Luke Fletcher,et al.  A Situationally Aware Voice‐commandable Robotic Forklift Working Alongside People in Unstructured Outdoor Environments , 2015, J. Field Robotics.

[99]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[100]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[101]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[102]  Chet Meyers,et al.  Promoting Active Learning: Strategies for the College Classroom , 1993 .

[103]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[104]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[105]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[106]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[107]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[108]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[109]  Yiannis Aloimonos,et al.  Learning the spatial semantics of manipulation actions through preposition grounding , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[110]  Kais Dukes,et al.  SemEval-2014 Task 6: Supervised Semantic Parsing of Robotic Spatial Commands , 2014, *SEMEVAL.

[111]  J. Stevens,et al.  The Origin of Consciousness in the Breakdown of the Bicameral Mind by , 1978, Neurology.

[112]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[113]  Jeffrey Mark Siskind,et al.  A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video , 2015, J. Artif. Intell. Res..

[114]  Bingbing Ni,et al.  Assistive tagging: A survey of multimedia tagging with human-computer joint exploration , 2012, CSUR.

[115]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[116]  Xiaodong Yu,et al.  Active scene recognition with vision and language , 2011, 2011 International Conference on Computer Vision.

[117]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[118]  Chris Dyer,et al.  Notes on Noise Contrastive Estimation and Negative Sampling , 2014, ArXiv.

[119]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[120]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[121]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[122]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[123]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[124]  Jitendra Malik,et al.  The three R's of computer vision: Recognition, reconstruction and reorganization , 2016, Pattern Recognit. Lett..

[125]  Rada Mihalcea,et al.  Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness , 2011, IJCNLP.

[126]  P. Gärdenfors The Geometry of Meaning: Semantics Based on Conceptual Spaces , 2014 .

[127]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[128]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[129]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[130]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[131]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[133]  Ali Farhadi,et al.  Visalogy: Answering Visual Analogy Questions , 2015, NIPS.

[134]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[135]  Stéphane Herbin,et al.  Semantic hierarchies for image annotation: A survey , 2012, Pattern Recognit..

[136]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[137]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[138]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[139]  Yiannis Aloimonos,et al.  Contour Motion Estimation for Asynchronous Event-Driven Cameras , 2014, Proceedings of the IEEE.

[140]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[141]  Raymond J. Mooney,et al.  Learning to Connect Language and Perception , 2008, AAAI.

[142]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[143]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[144]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[145]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[146]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[147]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[148]  Li Ren A Survey on Statistical Topic Modeling , 2013 .

[149]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[150]  Jennifer A. Strangfeld Promoting Active Learning , 2013 .

[151]  Mary Czerwinski,et al.  Voicepedia: towards speech-based access to unstructured information , 2007, INTERSPEECH.

[152]  AloimonosYiannis,et al.  Computer Vision and Natural Language Processing , 2016 .

[153]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[154]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[155]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[156]  A. Bandura Psychological Modeling; Conflicting Theories , 1971 .

[157]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[158]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[159]  Brian McMahan,et al.  A Bayesian Model of Grounded Color Semantics , 2015, TACL.

[160]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[161]  J. L. Austin,et al.  The foundations of arithmetic : a logico-mathematical enquiry into the concept of number , 1951 .

[162]  Matthew W. Crocker,et al.  Exploiting Listener Gaze to Improve Situated Communication in Dynamic Virtual Environments , 2016, Cogn. Sci..

[163]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[164]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[165]  M. Carrasco Visual attention: The past 25 years , 2011, Vision Research.

[166]  Yiannis Aloimonos,et al.  Cluttered scene segmentation using the symmetry constraint , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[167]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[168]  Song-Chun Zhu,et al.  Attribute And-Or Grammar for Joint Parsing of Human Attributes, Part and Pose , 2016, ArXiv.

[169]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[170]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[171]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[172]  A. Chemero An Outline of a Theory of Affordances , 2003, How Shall Affordances be Refined? Four Perspectives.

[173]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[174]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[175]  Jeffrey Mark Siskind,et al.  Simultaneous Object Detection, Tracking, and Event Recognition , 2012, ArXiv.

[176]  Yiannis Aloimonos,et al.  Fast 2D border ownership assignment , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[177]  Konstantina Garoufi,et al.  Planning-Based Models of Natural Language Generation , 2014, Lang. Linguistics Compass.

[178]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[179]  Rama Chellappa,et al.  Attributes for Improved Attributes: A Multi-Task Network for Attribute Classification , 2016, ArXiv.

[180]  William I. Grosky,et al.  Idea Grou p Inc . Copy right Idea Grou p Inc . Copy right Idea Grou p Inc . Copy right Idea Grou p Inc . Chapter II Bridging the Semantic Gap in Image Retrieval , 2018 .

[181]  Alessandro Saffiotti,et al.  Anchoring Symbols to Sensor Data: Preliminary Report , 2000, AAAI/IAAI.

[182]  Song-Chun Zhu,et al.  Single-View 3D Scene Parsing by Attributed Grammar , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[183]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[184]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[185]  Vittorio Ferrari,et al.  Situational object boundary detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[186]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[187]  H. Barlow Vision: A computational investigation into the human representation and processing of visual information: David Marr. San Francisco: W. H. Freeman, 1982. pp. xvi + 397 , 1983 .

[188]  A. Cangelosi The grounding and sharing of symbols , 2006 .

[189]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[190]  Bastian Leibe,et al.  Visual Object Recognition , 2011, Visual Object Recognition.

[191]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[192]  J. Piaget Play, dreams and imitation in childhood , 1951 .

[193]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[194]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[195]  Carina Silberer,et al.  Models of Semantic Representation with Visual Attributes , 2013, ACL.

[196]  Vladimir Pavlovic,et al.  A New Baseline for Image Annotation , 2008, ECCV.

[197]  Ali Farhadi,et al.  Attribute-centric recognition for cross-category generalization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[198]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[199]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[200]  Xiaoping Chen,et al.  Ontology Based Object Categorization for Robots , 2005, PAKM.

[201]  Yiannis Aloimonos,et al.  Towards a Watson that sees: Language-guided action recognition for robots , 2012, 2012 IEEE International Conference on Robotics and Automation.

[202]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[203]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[204]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[205]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[206]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[207]  Massimo Poesio,et al.  Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts , 2013, EMNLP.

[208]  David Yarowsky,et al.  Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , 2013, EMNLP 2013.

[209]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[210]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[211]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[212]  G. Rizzolatti,et al.  The mirror-neuron system. , 2004, Annual review of neuroscience.

[213]  Changsong Liu,et al.  Towards Situated Dialogue: Revisiting Referring Expression Generation , 2013, EMNLP.

[214]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[215]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[216]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[217]  Alberto Del Bimbo,et al.  Socializing the Semantic Gap , 2015, ACM Comput. Surv..

[218]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[219]  Mark Steedman,et al.  Surface structure and interpretation , 1996, Linguistic inquiry.

[220]  Larry S. Davis,et al.  Fast Automatic Video Retrieval using Web Images , 2015, ArXiv.

[221]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[222]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[223]  Alexander C. Berg,et al.  Finding iconic images , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[224]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[225]  Rainer Stiefelhagen,et al.  Book2Movie: Aligning video scenes with book chapters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[226]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[227]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 1999, J. Artif. Intell. Res..

[228]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[229]  Raffaella Bernardi,et al.  TUHOI: Trento Universal Human Object Interaction Dataset , 2014, VL@COLING.

[230]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[231]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[232]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[233]  Douglas Summers-Stay,et al.  Using a minimal action grammar for activity understanding in the real world , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[234]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[235]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[236]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[237]  Song-Chun Zhu,et al.  A Unified Framework for Human-Robot Knowledge Transfer , 2015, AAAI Fall Symposia.

[238]  Subhransu Maji,et al.  Automatic Image Annotation using Deep Learning Representations , 2015, ICMR.

[239]  D. Roy Grounding words in perception and action: computational insights , 2005, Trends in Cognitive Sciences.

[240]  Francis Ferraro,et al.  On Available Corpora for Empirical Methods in Vision & Language , 2015, ArXiv.

[241]  Marco Baroni,et al.  Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[242]  Douglas Greenlee,et al.  Semiotic and Significs: The Correspondence between Charles S. Peirce and Victoria Lady Welby , 1978 .

[243]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[244]  A. W. Evans,et al.  Applying the Wizard-of-Oz Technique to Multimodal Human-Robot Dialogue , 2017, ArXiv.

[245]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[246]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[247]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[248]  Ali Farhadi,et al.  Designing representational architectures in recognition , 2011 .

[249]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[250]  Antonio Torralba,et al.  Context models and out-of-context objects , 2012, Pattern Recognit. Lett..

[251]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[252]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[253]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[254]  Alexander M. Bronstein,et al.  Three-Dimensional Face Recognition , 2005, International Journal of Computer Vision.

[255]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[256]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[257]  Changsong Liu,et al.  Learning to Mediate Perceptual Differences in Situated Human-Robot Dialogue , 2015, AAAI.

[258]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[259]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[260]  木村 和夫 Pragmatics , 1997, Language Teaching.

[261]  Ross A. Knepper,et al.  Asking for Help Using Inverse Semantics , 2014, Robotics: Science and Systems.

[262]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[263]  Francis Ferraro,et al.  A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[264]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[265]  Matthew Stone,et al.  Sentence generation as a planning problem , 2007, ACL.

[266]  Abhinav Gupta,et al.  Beyond Nouns and Verbs , 2009 .

[267]  G. Adam The relationship between attention and working memory , 2011 .

[268]  Gordon Cheng,et al.  New materials and advances in making electronic skin for interactive robots , 2015, Adv. Robotics.

[269]  Michael Beetz,et al.  Visually Tracking Football Games Based on TV Broadcasts , 2007, IJCAI.

[270]  Jeffrey Mark Siskind,et al.  Saying What You're Looking For: Linguistics Meets Video Search , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[271]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[272]  Chitta Baral,et al.  From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge , 2015, ArXiv.

[273]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[274]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[275]  Alap Karapurkar Modeling Human Activities , 2005 .

[276]  Changsong Liu,et al.  Probabilistic Labeling for Efficient Referential Grounding based on Collaborative Discourse , 2014, ACL.

[277]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[278]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[279]  Katrin Erk,et al.  Representing Meaning with a Combination of Logical Form and Vectors , 2015, ArXiv.

[280]  Yiannis Aloimonos,et al.  Robots with language: Multi-label visual recognition using NLP , 2013, 2013 IEEE International Conference on Robotics and Automation.

[281]  B. Scassellati,et al.  Who is IT? Inferring role and intent from agent motion , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[282]  Angel X. Chang,et al.  Semantic Parsing for Text to 3D Scene Generation , 2014, ACL 2014.

[283]  Yulia Tsvetkov,et al.  Sparse Overcomplete Word Vector Representations , 2015, ACL.

[284]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[285]  Jonathan H. Connell,et al.  A Statistical Approach for Real-time Robust Background Subtrac tion and Shadow Detection , 2014 .

[286]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[287]  Marcus Rohrbach,et al.  A Multi-scale Multiple Instance Video Description Network , 2015, ArXiv.

[288]  N. Cowan What are the differences between long-term, short-term, and working memory? , 2008, Progress in brain research.

[289]  Yiannis Aloimonos,et al.  Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[290]  Yi Li,et al.  Neural Self Talk: Image Understanding via Continuous Questioning and Answering , 2015, ArXiv.

[291]  Hinrich Schütze,et al.  AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes , 2015, ACL.

[292]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[293]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[294]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[295]  Dieter Fox,et al.  Object Recognition in 3D Point Clouds Using Web Data and Domain Adaptation , 2010, Int. J. Robotics Res..

[296]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[297]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[298]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[299]  G. Aschersleben,et al.  The Theory of Event Coding (TEC): a framework for perception and action planning. , 2001, The Behavioral and brain sciences.

[300]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[301]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[302]  Michael X Cohen,et al.  Organizational Routines Are Stored as Procedural Memory: Evidence from a Laboratory Study , 1994 .

[303]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[304]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[305]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[306]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[307]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[308]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[309]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[310]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[311]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[312]  Deriving Boolean structures from distributional vectors , 2015, Transactions of the Association for Computational Linguistics.

[313]  Alexander Koller,et al.  Automated Planning for Situated Natural Language Generation , 2010, ACL.

[314]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[315]  Frank Keller,et al.  Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.

[316]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[317]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[318]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[319]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[320]  Silvia Coradeschi,et al.  A Short Review of Symbol Grounding in Robotic and Intelligent Systems , 2013, KI - Künstliche Intelligenz.

[321]  Eren Erdal Aksoy,et al.  Learning the Semantics of Manipulation Action , 2015, ACL.

[322]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[323]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[324]  Douglas Summers-Stay,et al.  Productive Vision: Methods for Automatic Image Comprehension , 2013 .

[325]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[326]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[327]  Wei Lin,et al.  Revisiting Word Embedding for Contrasting Meaning , 2015, ACL.

[328]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[329]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[330]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[331]  Yiannis Aloimonos,et al.  A Language for Human Action , 2007, Computer.

[332]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[333]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[334]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.