Using deep learning to bridge the gap between perception and intelligence

We propose a method to for robot planning using deep learning to integrate object detection and natural language understanding. This is different from other techniques, such the RCTA World Model,1 which explicitly defines the interfaces between each module. These boundaries simplify the design and testing of complex robotic systems, but also introduce constraints that may reduce overall system performance. For example, perception tasks generate large amounts of data, but much of it is discarded to simplify interpretation by higher level tasks, e.g., a 3D object becomes a point in space, or a distribution over classifications is reduced to its mode. Further, errors in the robot’s overal task are not back-propagated to lower level tasks, and therefore, these tasks never adapt themselves to improve robot performance. We intend to address this by using a deep learning framework to replace these interfaces with learned interfaces that select what data is shared between modules and allow for error back-propagation that could adapt each module to the robot’s task. We will do this in a simplified system that accepts aerial orthographic images and simple commands and generates paths to achieve the command. Paths are learned from expert example via inverse optimal control. In time, we hope to evolve this simplified architecture towards something more complex and practical.

[1]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[6]  Robert Dean Common world model for unmanned systems , 2013, Defense, Security, and Sensing.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Daniel Munoz,et al.  Inference Machines: Parsing Scenes via Iterated Predictions , 2013 .

[9]  Nathan Ratliff,et al.  Learning to search: structured prediction techniques for imitation learning , 2009 .

[10]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[11]  Jean Oh,et al.  Common world model for unmanned systems: Phase 2 , 2014, Defense + Security Symposium.

[12]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Felix Duvallet,et al.  Natural Language Direction Following for Robots in Unstructured Unknown Environments , 2015 .

[14]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[15]  Martial Hebert,et al.  Integrated Intelligence for Human-Robot Teams , 2016, ISER.

[16]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[17]  Jean Oh,et al.  RCTA capstone assessment , 2015, Defense + Security Symposium.

[18]  Xinlei Chen,et al.  PixelNet: Towards a General Pixel-level Architecture , 2016, ArXiv.

[19]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[20]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.