Temporal encoded F-formation system for social interaction detection

In the context of a social gathering, such as a cocktail party, the memorable moments are generally captured by professional photographers or by the participants. The latter case is often undesirable because many participants would rather enjoy the event instead of being occupied by the photo-taking task. Motivated by this scenario, we propose the use of a set of cameras to automatically take photos. Instead of performing dense analysis on all cameras for photo capturing, we first detect the occurrence and location of social interactions via F-formation detection. In the sociology literature, F-formation is a concept used to define social interactions, where each detection only requires the spatial location and orientation of each participant. This information can be robustly obtained with additional Kinect depth sensors. In this paper, we propose an extended F-formation system for robust detection of interactions and interactants. The extended F-formation system employs a heat-map based feature representation for each individual, namely Interaction Space (IS), to model their location, orientation, and temporal information. Using the temporally encoded IS for each detected interactant, we propose a best-view camera selection framework to detect the corresponding best view camera for each detected social interaction. The extended F-formation system is evaluated with synthetic data on multiple scenarios. To demonstrate the effectiveness of the proposed system, we conducted a user study to compare our best view camera ranking with human's ranking using real-world data.

[1]  Peter H. N. de With,et al.  Automatic mashup generation from multiple-camera concert recordings , 2010, ACM Multimedia.

[2]  Mohan M. Trivedi,et al.  Multi-person interaction and activity analysis: a synergistic track- and body-level analysis framework , 2007, Machine Vision and Applications.

[3]  Wei Tsang Ooi,et al.  MoViMash: online mobile video mashup , 2012, ACM Multimedia.

[4]  Sebastian Boring,et al.  Gradual engagement: facilitating information exchange between digital devices as a function of proximity , 2012, ITS.

[5]  Marco Emanuele Campanella,et al.  Understanding behaviors and needs for home videos , 2008 .

[6]  Jianxin Wu,et al.  A new heat-map-based algorithm for human group activity recognition , 2012, ACM Multimedia.

[7]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[8]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  A. Kendon Conducting Interaction: Patterns of Behavior in Focused Encounters , 1990 .

[10]  F. Dirfaux Key frame selection to represent a video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[11]  Datong Chen,et al.  Detecting social interactions of the elderly in a nursing home environment , 2007, TOMCCAP.

[12]  Alessio Del Bue,et al.  Social interaction discovery by statistical analysis of F-formations , 2011, BMVC.

[13]  Ramesh C. Jain,et al.  Social pixels: genesis and evaluation , 2010, ACM Multimedia.

[14]  Vittorio Murino,et al.  Social interactions by visual focus of attention in a three‐dimensional environment , 2013, Expert Syst. J. Knowl. Eng..

[15]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Yongkang Wong,et al.  Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition , 2011, CVPR 2011 WORKSHOPS.

[17]  Wolfgang Effelsberg,et al.  An automatic cameraman in a lecture recording system , 2007, Emme '07.

[18]  Martin K. Purvis,et al.  Key-frame extraction of wildlife video based on semantic context modeling , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[19]  Anne Spalanzani,et al.  Understanding human interaction for probabilistic autonomous navigation using Risk-RRT approach , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Anoop Gupta,et al.  Automating lecture capture and broadcast: technology and videography , 2004, Multimedia Systems.

[21]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22]  Saul Greenberg,et al.  Cross-device interaction via micro-mobility and f-formations , 2012, UIST.

[23]  Daniel Gatica-Perez,et al.  Automatic nonverbal analysis of social interaction in small groups: A review , 2009, Image Vis. Comput..

[24]  E. Hall,et al.  The Hidden Dimension , 1970 .

[25]  Hideaki Kuzuoka,et al.  Impact of seating positions on group video communication , 2008, CSCW.

[26]  Mohan M. Trivedi,et al.  Activity monitoring and summarization for an intelligent meeting room , 2000, Proceedings Workshop on Human Motion.

[27]  Mohan S. Kankanhalli,et al.  Decision-theoretic approach to maximizing observation of multiple targets in multi-camera surveillance , 2012, AAMAS.