Towards a Multimodal and Context-Aware Framework for Human Navigational Intent Inference

A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jaime F. Fisac,et al.  Planning, Fast and Slow: A Framework for Adaptive Real-Time Safe Trajectory Planning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Dinesh Manocha,et al.  TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[5]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[8]  Henggang Cui,et al.  Short-term Motion Prediction of Traffic Actors for Autonomous Driving using Deep Convolutional Networks , 2018 .

[9]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ramakant Nevatia,et al.  Robust Object Tracking by Hierarchical Association of Detection Responses , 2008, ECCV.

[11]  Wei Liu,et al.  Deep Learning Driven Visual Path Prediction From a Single Image , 2016, IEEE Transactions on Image Processing.

[12]  Dani Lischinski,et al.  Crowds by Example , 2007, Comput. Graph. Forum.

[13]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[14]  Kris M. Kitani,et al.  Forecasting Interactive Dynamics of Pedestrians with Fictitious Play , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Angelica Lim,et al.  Habit detection within a long-term interaction with a social robot: an exploratory study , 2016, DAA '16.

[16]  Claire J. Tomlin,et al.  Annual Review of Control , Robotics , and Autonomous Systems Hamilton – Jacobi Reachability : Some Recent Theoretical Advances and Applications in Unmanned Airspace Management , 2019 .

[17]  Silvio Savarese,et al.  Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes , 2016, ECCV.

[18]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[19]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Dariu Gavrila,et al.  UvA-DARE ( Digital Academic Repository ) Pedestrian Path Prediction with Recursive Bayesian Filters : A Comparative Study , 2013 .

[21]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Juan Carlos Niebles,et al.  Peeking Into the Future: Predicting Future Person Activities and Locations in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Dariu Gavrila,et al.  Context-Based Pedestrian Path Prediction , 2014, ECCV.

[25]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Song-Chun Zhu,et al.  Learning and Inferring “Dark Matter” and Predicting Human Intents and Trajectories in Videos , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.