Improved Optical Flow for Gesture-based Human-robot Interaction

Gesture interaction is a natural way of communicating with a robot as an alternative to speech. Gesture recognition methods leverage optical flow in order to understand human motion. However, while accurate optical flow estimation (i.e., traditional) methods are costly in terms of runtime, fast estimation (i.e., deep learning) methods’ accuracy can be improved. In this paper, we present a pipeline for gesture-based human-robot interaction that uses a novel optical flow estimation method in order to achieve an improved speed-accuracy trade-off. Our optical flow estimation method introduces four improvements to previous deep learning-based methods: strong feature extractors, attention to contours, midway features, and a combination of these three. This results in a better understanding of motion, and a finer representation of silhouettes. In order to evaluate our pipeline, we generated our own dataset, MIBURI, which contains gestures to command a house service robot. In our experiments, we show how our method improves not only optical flow estimation, but also gesture recognition, offering a speed-accuracy trade-off more realistic for practical robot applications.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Karon E. MacLean,et al.  Gestures for industry Intuitive human-robot communication from human observation , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[4]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[5]  Xi Li,et al.  Deep Optical Flow Estimation Via Multi-Scale Correspondence Structure Learning , 2017, IJCAI.

[6]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Christoph Bartneck,et al.  You just do not understand me! Speech Recognition in Human Robot Interaction , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[9]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[11]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[12]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Isabelle Guyon,et al.  The ChaLearn gesture dataset (CGD 2011) , 2014, Machine Vision and Applications.

[15]  S. Goldin-Meadow,et al.  The role of gesture in communication and thinking , 1999, Trends in Cognitive Sciences.

[16]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[17]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[18]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[19]  Yi Zhu,et al.  DenseNet for dense flow , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[20]  Ana Paiva,et al.  Social Robots for Long-Term Interaction: A Survey , 2013, International Journal of Social Robotics.

[21]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Britta Wrede,et al.  Domestic applications for social robots: an online survey on the influence of appearance and capabilities , 2008 .

[23]  Jagdish Lal Raheja,et al.  Real-Time Robotic Hand Control Using Hand Gestures , 2010, 2010 Second International Conference on Machine Learning and Computing.

[24]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[25]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[27]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ali Sekmen,et al.  Human–robot interaction via voice-controllable intelligent user interface , 2007, Robotica.

[29]  Stefan Kopp,et al.  Generation and Evaluation of Communicative Robot Gesture , 2012, Int. J. Soc. Robotics.

[30]  Paulo Menezes,et al.  Face tracking and hand gesture recognition for human-robot interaction , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).