Identifying First-Person Camera Wearers in Third-Person Videos

We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in scenarios in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene. To do this, we need to establish person-level correspondences across first-and third-person videos, which is challenging because the camera wearer is not visible from his/her own egocentric video, preventing the use of direct feature matching. In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge. We formulate the problem as learning a joint embedding space for first-and third-person videos that considers both spatial-and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct first-and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach performs significantly better than several baselines, in part by learning the first-and third-person features optimized for matching jointly with the distance measure itself.

[1]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yaser Sheikh,et al.  Predicting Primary Gaze Behavior Using Social Saliency Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Stefan Lee,et al.  This Hand Is My Hand: A Probabilistic Approach to Hand Disambiguation in Egocentric Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Chen Change Loy,et al.  Person Re-Identification , 2014, Advances in Computer Vision and Pattern Recognition.

[6]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[7]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[8]  Yoichi Sato,et al.  Ego-surfing first person videos , 2015, CVPR.

[9]  Julius Ziegler,et al.  StereoScan: Dense 3d reconstruction in real-time , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[10]  Ramón González Sánchez Simultaneous Localization and Mapping (SLAM) , 2007 .

[11]  James R. Bergen,et al.  Visual odometry , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[12]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Min Sun,et al.  Recognition from Hand Cameras: A Revisit with Deep Learning , 2016, ECCV.

[14]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ali Farhadi,et al.  Action Recognition in the Presence of One Egocentric and Multiple Static Cameras , 2014, ACCV.

[20]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[21]  Kavita Bala,et al.  Learning visual similarity for product design with convolutional neural networks , 2015, ACM Trans. Graph..

[22]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Chen Yu,et al.  Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View , 2015, ICMI.

[24]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Friedrich Fraundorfer,et al.  Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[26]  Rama Chellappa,et al.  Cross-View Action Recognition via a Transferable Dictionary Pair , 2012, BMVC.

[27]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[28]  Shmuel Peleg,et al.  Head Motion Signatures from Egocentric Videos , 2014, ACCV.

[29]  Ali Borji,et al.  Ego 2 Top : Matching Viewers in Egocentric and Top-view Cameras , 2016 .

[30]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.