A Nonparametric Model for Multimodal Collaborative Activities Summarization

Ego-centric data streams provide a unique opportunity to reason about joint behavior by pooling data across individuals. This is especially evident in urban environments teeming with human activities, but which suffer from incomplete and noisy data. Collaborative human activities exhibit common spatial, temporal, and visual characteristics facilitating inference across individuals from multiple sensory modalities as we explore in this paper from the perspective of meetings. We propose a new Bayesian nonparametric model that enables us to efficiently pool video and GPS data towards collaborative activities analysis from multiple individuals. We demonstrate the utility of this model for inference tasks such as activity detection, classification, and summarization. We further demonstrate how spatio-temporal structure embedded in our model enables better understanding of partial and noisy observations such as localization and face detections based on social interactions. We show results on both synthetic experiments and a new dataset of egocentric video and noisy GPS data from multiple individuals.

[1]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[2]  Jiebo Luo,et al.  Mining GPS traces and visual words for event classification , 2008, MIR '08.

[3]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[5]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sean J. Barbeau,et al.  Positional Accuracy of Assisted GPS Data from High-Sensitivity GPS-enabled Mobile Phones , 2011, Journal of Navigation.

[7]  William T. Freeman,et al.  Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology , 1999, Neural Computation.

[8]  K. Shelley Developing the American Time Use Survey activity classification system , 2005 .

[9]  Yanwei Fu,et al.  Multi-view Metric Learning for Multi-view Video Summarization , 2014, 2016 International Conference on Cyberworlds (CW).

[10]  Bernt Schiele,et al.  Discovery of activity patterns using topic models , 2008 .

[11]  Jiebo Luo,et al.  Photo Stream Alignment and Summarization for Collaborative Photo Collection and Sharing , 2012, IEEE Transactions on Multimedia.

[12]  Hanspeter Pfister,et al.  Multi-video browsing and summarization , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[15]  Nicu Sebe,et al.  Egocentric Daily Activity Recognition via Multitask Clustering , 2015, IEEE Transactions on Image Processing.

[16]  Zhi-Hua Zhou,et al.  Multi-View Video Summarization , 2010, IEEE Transactions on Multimedia.

[17]  Ba Tu Truong,et al.  Utility-Based Summarization of Home Videos , 2007, MMM.

[18]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[19]  Jouni Hartikainen,et al.  Kalman filtering and smoothing solutions to temporal Gaussian process regression models , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[20]  Ying Zhang,et al.  Aesthetics-Guided Summarization from Multiple User Generated Videos , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[21]  Tal Hassner,et al.  The One-Shot similarity kernel , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Marie Kratz,et al.  Level curves crossings and applications for Gaussian models , 2010 .

[23]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[24]  Henry A. Kautz,et al.  Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields , 2007, Int. J. Robotics Res..

[25]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[26]  M. Stein Statistical Interpolation of Spatial Data: Some Theory for Kriging , 1999 .

[27]  Yaser Sheikh,et al.  Automatic editing of footage from multiple social cameras , 2014, ACM Trans. Graph..

[28]  R. Adler On excursion sets, tube formulas and maxima of random fields , 2000 .

[29]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yaser Sheikh,et al.  Predicting Primary Gaze Behavior Using Social Saliency Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Jinwoo Shin,et al.  Large-scale log-determinant computation through stochastic Chebyshev expansions , 2015, ICML.

[32]  Bernt Schiele,et al.  A tutorial on human activity recognition using body-worn inertial sensors , 2014, CSUR.

[33]  Carl E. Rasmussen,et al.  Robust Filtering and Smoothing with Gaussian Processes , 2012, IEEE Transactions on Automatic Control.

[34]  Tal Hassner,et al.  Effective face frontalization in unconstrained images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shmuel Peleg,et al.  Wisdom of the Crowd in Egocentric Video Curation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.