Social interactions: A first-person perspective

This paper presents a method for the detection and recognition of social interactions in a day-long first-person video of u social event, like a trip to an amusement park. The location and orientation of faces are estimated and used to compute the line of sight for each face. The context provided by all the faces in a frame is used to convert the lines of sight into locations in space to which individuals attend. Further, individuals are assigned roles based on their patterns of attention. The rotes and locations of individuals are analyzed over time to detect and recognize the types of social interactions. In addition to patterns of face locations and attention, the head movements of the first-person can provide additional useful cues as to their attentional focus. We demonstrate encouraging results on detection and recognition of social interactions in first-person videos captured from multiple days of experience in amusement parks.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Ting Yu,et al.  Monitoring, recognizing and discovering social networks , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Z. Merali Here's looking at you, kid , 2008, Nature.

[4]  Bingbing Ni,et al.  Recognizing human group activities with localized causalities , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Trevor Darrell,et al.  Hidden-state Conditional Random Fields , 2006 .

[6]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[7]  Stefan Carlsson,et al.  Novelty detection from an ego-centric perspective , 2011, CVPR 2011.

[8]  Jim Gemmell,et al.  Exploiting Location and Time for Photo Search and Storytelling in MyLifeBits , 2004 .

[9]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.

[10]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[11]  Yang Wang,et al.  Beyond Actions: Discriminative Models for Contextual Group Activities , 2010, NIPS.

[12]  Alper Yilmaz,et al.  Learning Relations among Movie Characters: A Social Network Perspective , 2010, ECCV.

[13]  Alex Pentland,et al.  Sensing and modeling human networks , 2004 .

[14]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[15]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Andrew Zisserman,et al.  "Here's looking at you, kid". Detecting people looking at each other in videos , 2011, BMVC.

[17]  Gordon Bell,et al.  Passive capture and ensuing issues for a personal lifetime store , 2004, CARPE'04.

[18]  Alex Pentland,et al.  An Interactive Computer Vision System DyPERS: Dynamic Personal Enhanced Reality System , 1999, ICVS.

[19]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[20]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[21]  Ian D. Reid,et al.  Guiding Visual Surveillance by Tracking Human Attention , 2009, BMVC.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.