Who is where?: Matching People in Video to Wearable Acceleration During Crowded Mingling Events

We address the challenging problem of associating acceleration data from a wearable sensor with the corresponding spatio-temporal region of a person in video during crowded mingling scenarios. This is an important first step for multi-sensor behavior analysis using these two modalities. Clearly, as the numbers of people in a scene increases, there is also a need to robustly and automatically associate a region of the video with each person's device. We propose a hierarchical association approach which exploits the spatial context of the scene, outperforming the state-of-the-art approaches significantly. Moreover, we present experiments on matching from 3 to more than 130 acceleration and video streams which, to our knowledge, is significantly larger than prior works where only up to 5 device streams are associated.

[1]  Maarten van Steen,et al.  From proximity sensing to spatio-temporal social graphs , 2014, 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom).

[2]  Mauro Dell'Amico,et al.  8. Quadratic Assignment Problems: Algorithms , 2009 .

[3]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[4]  Mohamed Chetouani,et al.  Interpersonal Synchrony: A Survey of Evaluation Methods across Disciplines , 2012, IEEE Transactions on Affective Computing.

[5]  Lu Zhang,et al.  Structure Preserving Object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Marcello Pelillo,et al.  A Game-Theoretic Probabilistic Approach for Detecting Conversational Groups , 2014, ACCV.

[7]  Eranda C Ela,et al.  Assignment Problems , 1964, Comput. J..

[8]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[9]  Gernot Bahle,et al.  I see you: How to improve wearable activity recognition by leveraging information from environmental cameras , 2013, 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops).

[10]  Hrvoje Benko,et al.  CrossMotion: Fusing Device and Image Motion for User Identification, Tracking and Device Association , 2014, ICMI.

[11]  Ben J. A. Kröse,et al.  Detecting F-formations as dominant sets , 2011, ICMI '11.

[12]  Andreas Savvides,et al.  Tasking networked CCTV cameras and mobile phones to identify and localize multiple people , 2010, UbiComp.

[13]  Koichi Hashimoto,et al.  Identifying a moving object with an accelerometer in a camera view , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Mahsan Rofouei,et al.  Your phone or mine?: fusing body, touch and device sensing for multi-user device-display interaction , 2012, CHI.

[15]  Nicu Sebe,et al.  Analyzing Free-standing Conversational Groups: A Multimodal Approach , 2015, ACM Multimedia.

[16]  Mohan S. Kankanhalli,et al.  Temporal encoded F-formation system for social interaction detection , 2013, ACM Multimedia.

[17]  Gwenn Englebienne,et al.  How Was It?: Exploiting Smartphone Sensing to Measure Implicit Audience Responses to Live Performances , 2015, ACM Multimedia.