Customizing First Person Image Through Desired Actions

This paper studies a problem of inverse visual path planning: creating a visual scene from a first person action. Our conjecture is that the spatial arrangement of a first person visual scene is deployed to afford an action, and therefore, the action can be inversely used to synthesize a new scene such that the action is feasible. As a proof-of-concept, we focus on linking visual experiences induced by walking. A key innovation of this paper is a concept of ActionTunnel---a 3D virtual tunnel along the future trajectory encoding what the wearer will visually experience as moving into the scene. This connects two distinctive first person images through similar walking paths. Our method takes a first person image with a user defined future trajectory and outputs a new image that can afford the future motion. The image is created by combining present and future ActionTunnels in 3D where the missing pixels in adjoining area are computed by a generative adversarial network. Our work can provide a travel across different first person experiences in diverse real world scenes.

[1]  Alexei A. Efros,et al.  Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships , 2009, NIPS.

[2]  Alexei A. Efros,et al.  KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yaser Sheikh,et al.  Predicting Primary Gaze Behavior Using Social Saliency Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Kristen Grauman,et al.  Detecting Engagement in Egocentric Video , 2016, ECCV.

[8]  Jianbo Shi,et al.  First Person Action-Object Detection with EgoNet , 2016, Robotics: Science and Systems.

[9]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[13]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[14]  Marina Weber,et al.  Elements Of Episodic Memory , 2016 .

[15]  Alexei A. Efros,et al.  Data-driven visual similarity for cross-domain image matching , 2011, ACM Trans. Graph..

[16]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Richard Szeliski,et al.  Finding paths through the world's photos , 2008, ACM Trans. Graph..

[18]  David Melcher,et al.  Persistence of visual memory for scenes , 2001, Nature.

[19]  Jan Kautz,et al.  Videoscapes: exploring sparse, unstructured video collections , 2012, ACM Trans. Graph..

[20]  Ira Kemelmacher-Shlizerman,et al.  Exploring photobios , 2011, ACM Trans. Graph..

[21]  G. Woodman,et al.  Enhancing long-term memory with stimulation tunes visual attention in one trial , 2014, Proceedings of the National Academy of Sciences.

[22]  Edward T. Hall,et al.  A System for the Notation of Proxemic Behavior1 , 1963 .

[23]  Kristen Grauman,et al.  Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[24]  Maro G. Machizawa,et al.  Neural activity predicts individual differences in visual working memory capacity , 2004, Nature.

[25]  Ira Kemelmacher-Shlizerman,et al.  Being John Malkovich , 2010, ECCV.

[26]  Michael Cohen,et al.  First-person Hyperlapse Videos , 2014, SIGGRAPH 2014.

[27]  Timothy F. Brady,et al.  A review of visual memory capacity: Beyond individual items and toward structured representations. , 2011, Journal of vision.

[28]  Yong Jae Lee,et al.  Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  James M. Rehg,et al.  BioGlass: Physiological parameter estimation using a head-mounted wearable device , 2014 .

[30]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Yong Jae Lee,et al.  ShadowDraw: real-time user guidance for freehand drawing , 2011, ACM Trans. Graph..

[32]  Yong Jae Lee,et al.  AverageExplorer: interactive exploration and alignment of visual data collections , 2014, ACM Trans. Graph..

[33]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[34]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[38]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[39]  Yaser Sheikh,et al.  3D Social Saliency from Head-mounted Cameras , 2012, NIPS.

[40]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[42]  Leslie G. Ungerleider,et al.  A neural system for human visual working memory. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[43]  James M. Rehg,et al.  BioGlass: Physiological parameter estimation using a head-mounted wearable device , 2014, 2014 4th International Conference on Wireless Mobile Communication and Healthcare - Transforming Healthcare Through Innovations in Mobile and Wireless Technologies (MOBIHEALTH).

[44]  Shai Avidan,et al.  Photo Sequencing , 2014, International Journal of Computer Vision.

[45]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Frank Dellaert,et al.  Probabilistic temporal inference on reconstructed 3D scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  D. Wolpert,et al.  Principles of sensorimotor learning , 2011, Nature Reviews Neuroscience.

[48]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[49]  Jianbo Shi,et al.  Force from Motion: Decoding Physical Sensation in a First Person Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[51]  Jianbo Shi,et al.  Social Behavior Prediction from First Person Videos , 2016, ArXiv.

[52]  Jason M. Saragih Principal regression analysis , 2011, CVPR 2011.

[53]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[54]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[55]  Ken-ichi Anjyo,et al.  Tour into the picture: using a spidery mesh interface to make animation from a single image , 1997, SIGGRAPH.

[56]  Yaser Sheikh,et al.  Automatic editing of footage from multiple social cameras , 2014, ACM Trans. Graph..

[57]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[58]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Steven M. Seitz,et al.  Time-lapse mining from internet photos , 2015, ACM Trans. Graph..