论文信息 - Fourth-Person Captioning: Describing Daily Events by Uni-supervised and Tri-regularized Training

Fourth-Person Captioning: Describing Daily Events by Uni-supervised and Tri-regularized Training

We aim to develop a supporting system which enhances the ability of human's short-term visual memory in an intelligent space where the human and a service robot coexist. Particularly, this paper focuses on how we can interpret and record diverse and complex life events on behalf of humans, from a multi-perspective viewpoint. We propose a novel method named "fourth-person captioning", which generates natural language descriptions by summarizing visual contexts complementarily from three types of cameras corresponding the first-, second-, and third-person viewpoint. We first extend the latest image captioning technique and design a new model to generate a sequence of words given the multiple images. Then we provide an effective training strategy that needs only annotations supervising images from a single viewpoint in a general caption dataset and unsupervised triplet instances in the intelligent space. As the three types of cameras, we select a wearable camera on the human, a robot-mounted camera, and an embedded camera, which can be defined as the first-, second-, and third-person viewpoint, respectively. We hope our work will accelerate a cross-modal interaction bridging the human's egocentric cognition and multi-perspective intelligence.

Ryo Kurazume | Yumi Iwashita | Akihiro Kawamura | Kazuto Nakashima

[1] Yong Jae Lee,et al. Identifying First-Person Camera Wearers in Third-Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[3] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[4] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Takuya Maekawa,et al. Egocentric Video Search via Physical Interactions , 2016, AAAI.

[6] Yasuo Kuniyoshi,et al. AI Goggles: Real-time Description and Retrieval in the Real World with Online Learning , 2009, 2009 Canadian Conference on Computer and Robot Vision.

[7] Subhransu Maji,et al. Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9] Petia Radeva,et al. Toward Storytelling From Visual Lifelogging: An Overview , 2015, IEEE Transactions on Human-Machine Systems.

[10] Silvio Savarese,et al. Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[11] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Petia Radeva,et al. Egocentric video description based on temporally-linked sequences , 2018, J. Vis. Commun. Image Represent..

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Ryo Kurazume,et al. Fourth-person sensing for a service robot , 2015, 2015 IEEE SENSORS.

[15] Li Fei-Fei,et al. Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[17] Larry H. Matthies,et al. Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] David J. Crandall,et al. DeepDiary: Automatically Captioning Lifelogging Image Streams , 2016, ECCV Workshops.

[19] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[20] Ryo Kurazume,et al. Feasibility study of IoRT platform “Big Sensor Box” , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21] Yoichi Sato,et al. Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Joo-Ho Lee,et al. Intelligent Space — concept and contents , 2002, Adv. Robotics.

[23] Larry H. Matthies,et al. First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.