Anticipating the Start of User Interaction for Service Robot in the Wild

A service robot is expected to provide proactive service for visitors who require its help. In contrast to passive service, e.g., providing service only after being spoken to, proactive service initiates an interaction at an early stage, e.g., talking to potential visitors who need the robot’s help in advance. This paper addresses how to anticipate the start of user interaction. We propose an approach using only a single RGB camera that anticipates whether a visitor will come to the robot for interaction or just pass it by. In the proposed approach, we (i) utilize the visitor’s pose information from captured images incorporating facial information, (ii) train a CNN-LSTM–based model in an end-to-end manner with an exponential loss for early anticipation, and (iii) during the training, the network branch for facial keypoints acquired as the part of the human pose information is taught to mimic the branch trained with the face image from a specialized face detector with a human verification. By virtue of (iii), at the inference, we can run our model in an embedded system processing only the pose information without an additional face detector and typical accuracy drop.We evaluated the proposed approach on our collected real world data with a real service robot and publicly available JPL interaction dataset and found that it achieved accurate anticipation performance.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[5]  Juan Carlos Niebles,et al.  Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Takayuki Kanda,et al.  How to approach humans?-strategies for social robots to initiate interaction , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[7]  Giulio Sandini,et al.  Humanizing Human-Robot Interaction: On the Importance of Mutual Understanding , 2018, IEEE Technology and Society Magazine.

[8]  Jake K. Aggarwal,et al.  Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[9]  Yutaka Satoh,et al.  Anticipating Traffic Accidents with Adaptive Loss and Large-Scale Incident DB , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[11]  Takayuki Kanda,et al.  Do You Need Help? A Robot Providing Information to People Who Behave Atypically , 2017, IEEE Transactions on Robotics.

[12]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Li-Chen Fu,et al.  Sensor fusion based human detection and tracking system for human-robot interaction , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Dirk Heylen,et al.  Head Gestures, Gaze and the Principles of Conversational Structure , 2006, Int. J. Humanoid Robotics.

[15]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Alex Bewley,et al.  Deep Cosine Metric Learning for Person Re-identification , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Niklas Bergström,et al.  Modeling of natural human-robot encounters , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[20]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[24]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Min Sun,et al.  Anticipating Accidents in Dashcam Videos , 2016, ACCV.

[27]  Sebastian Wrede,et al.  How to Open an Interaction Between Robot and Museum Visitor? Strategies to Establish a Focused Encounter in HRI , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[28]  Takayuki Kanda,et al.  May I help you? - Design of Human-like Polite Approaching Behavior- , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[29]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[30]  Ardhendu Behera,et al.  Latent Body-Pose guided DenseNet for Recognizing Driver’s Fine-grained Secondary Activities , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[31]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[32]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dominique Vaufreydaz,et al.  Multi-Sensors Engagement Detection with a Robot Companion in a Home Environment , 2012, IROS 2012.

[34]  Marek P. Michalowski,et al.  A spatial model of engagement for a social robot , 2006, 9th IEEE International Workshop on Advanced Motion Control, 2006..

[35]  Candace L. Sidner,et al.  Explorations in engagement for humans and robots , 2005, Artif. Intell..

[36]  Tatsuo Arai,et al.  Social navigation model based on human intention analysis using face orientation , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Moreno I. Coco,et al.  Action Anticipation: Reading the Intentions of Humans and Robots , 2018, IEEE Robotics and Automation Letters.