Learning Social Relations from Videos: Features, Models, and Analytics

Despite the progress made during recent years in video understanding, extracting relations among actors in a video is still a largely unexplored area. In this chapter, we review one of the ?rst studies towards learning such relations from videos using visual and auditory cues. The main contribution can be stated as the association of low-level video features to social relations by machine learning methodology. Specifically, support vector regression is leveraged to estimate local grouping cues from low-level visual and auditory features. These locally defined grouping cues are then synthesized to derive the affinity between actors. Finally, the social network defined by the resulting affinity is analyzed to ?nd communities of actors and identify the leader of each community. Furthermore, as an extension to the basic framework, we discuss the relationship between visual concepts and social relations. We demonstrate the performance of these approaches on a set of videos.

[1]  Meng Zhao,et al.  Event recognition based-on social roles in continuous video , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Mubarak Shah,et al.  Video scene segmentation using Markov chain Monte Carlo , 2006, IEEE Transactions on Multimedia.

[4]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Qiang Wu,et al.  Support vector regression for multi-view gait recognition based on local motion feature selection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Robert T. Collins,et al.  Automatically detecting the small group structure of a crowd , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[7]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[8]  Fei-Fei Li,et al.  Social Role Discovery in Human Events , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Antonis A. Argyros,et al.  Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Mubarak Shah,et al.  Movie genre classification by exploiting audio-visual features of previews , 2002, Object recognition supported by user interaction for service robots.

[12]  Mubarak Shah,et al.  A differential geometric approach to representing the human actions , 2008, Comput. Vis. Image Underst..

[13]  Andrew Zisserman,et al.  On film character retrieval in feature-length films , 2006 .

[14]  Yihong Gong,et al.  A Bayesian Approach Toward Finding Communities and Their Evolutions in Dynamic Social Networks , 2009, SDM.

[15]  D. Lazer,et al.  Inferring Social Network Structure using Mobile Phone Data , 2006 .

[16]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[17]  Changjie Tang,et al.  Discovering Organizational Structure in Dynamic Social Network , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18]  Britta Ruhnau,et al.  Eigenvector-centrality - a node-centrality? , 2000, Soc. Networks.

[19]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[21]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[22]  Sharath Pankanti,et al.  Graph based event detection from realistic videos using weak feature correspondence , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Gang Wang,et al.  Seeing People in Social Context: Recognizing People and Social Relationships , 2010, ECCV.

[25]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[26]  David R. Bull,et al.  Projective image restoration using sparsity regularization , 2013, 2013 IEEE International Conference on Image Processing.

[27]  Cathie Holden Giving Girls a Chance: patterns of talk in co‐operative group work , 1993 .

[28]  Ting Yu,et al.  Monitoring, recognizing and discovering social networks , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yu Fan,et al.  Learning Continuous-Time Social Network Dynamics , 2009, UAI.

[30]  Mubarak Shah,et al.  Chaotic Invariants for Human Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Gerhard Rigoll,et al.  Robust Multi-Modal Group Action Recognition in Meetings from Disturbed Videos with the Asynchronous Hidden Markov Model , 2007, 2007 IEEE International Conference on Image Processing.

[32]  James J. Little,et al.  A Linear Programming Approach for Multiple Object Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  David Lazer,et al.  Inferring friendship network structure by using mobile phone data , 2009, Proceedings of the National Academy of Sciences.

[36]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[37]  Wei-Ta Chu,et al.  RoleNet: Movie Analysis from the Perspective of Social Networks , 2009, IEEE Transactions on Multimedia.

[38]  Debra Myhill,et al.  Bad Boys and Good Girls? Patterns of Interaction and Response in Whole Class Teaching , 2002 .

[39]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Miguel Á. Carreira-Perpiñán,et al.  Constrained spectral clustering through affinity propagation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[43]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[44]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[45]  Alper Yilmaz,et al.  Learning Relations among Movie Characters: A Social Network Perspective , 2010, ECCV.

[46]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[47]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[48]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[49]  Denis Hamad,et al.  Weighted Support Vector Regression for robust single model estimation : application to motion segmentation in image sequences , 2007, 2007 International Joint Conference on Neural Networks.

[50]  Randy Goebel,et al.  Detecting Communities in Social Networks Using Max-Min Modularity , 2009, SDM.

[51]  Alper Yilmaz,et al.  Inferring social relations from visual concepts , 2011, 2011 International Conference on Computer Vision.