A Hierarchical Approach for Associating Body-Worn Sensors to Video Regions in Crowded Mingling Scenarios

We address the complex problem of associating several wearable devices with the spatio-temporal region of their wearers in video during crowded mingling events using only acceleration and proximity. This is a particularly important first step for multisensor behavior analysis using video and wearable technologies, where the privacy of the participants must be maintained. Most state-of-the-art works using these two modalities perform their association manually, which becomes practically unfeasible as the number of people in the scene increases. We proposed an automatic association method based on a hierarchical linear assignment optimization, which exploits the spatial context of the scene. Moreover, we present extensive experiments on matching from 2 to more than 69 acceleration and video streams, showing significant improvements over a random baseline in a real-world crowded mingling scenario. We also show the effectiveness of our method for incomplete or missing streams (up to a certain limit) and analyze the tradeoff between length of the streams and number of participants. Finally, we provide an analysis of failure cases, showing that deep understanding of the social actions within the context of the event is necessary to further improve performance on this intriguing task.

[1]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Maarten van Steen,et al.  From proximity sensing to spatio-temporal social graphs , 2014, 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom).

[3]  Ben J. A. Kröse,et al.  Detecting F-formations as dominant sets , 2011, ICMI '11.

[4]  Gregory D. Abowd,et al.  Automatic Synchronization of Wearable Sensors and Video-Cameras for Ground Truth Annotation -- A Practical Approach , 2012, 2012 16th International Symposium on Wearable Computers.

[5]  Marcello Pelillo,et al.  A Game-Theoretic Probabilistic Approach for Detecting Conversational Groups , 2014, ACCV.

[6]  Gernot Bahle,et al.  I see you: How to improve wearable activity recognition by leveraging information from environmental cameras , 2013, 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops).

[7]  Slim Essid,et al.  A Multimodal Approach to Speaker Diarization on TV Talk-Shows , 2013, IEEE Transactions on Multimedia.

[8]  Gwenn Englebienne,et al.  How Was It?: Exploiting Smartphone Sensing to Measure Implicit Audience Responses to Live Performances , 2015, ACM Multimedia.

[9]  Alessio Del Bue,et al.  Social interaction discovery by statistical analysis of F-formations , 2011, BMVC.

[10]  Daniel Gatica-Perez,et al.  Estimating Cohesion in Small Groups Using Audio-Visual Nonverbal Behavior , 2010, IEEE Transactions on Multimedia.

[11]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Bernt Schiele,et al.  Analyzing features for activity recognition , 2005, sOc-EUSAI '05.

[13]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[14]  Alex Pentland,et al.  Modeling Functional Roles Dynamics in Small Group Interactions , 2013, IEEE Transactions on Multimedia.

[15]  Hayley Hung,et al.  Who is where?: Matching People in Video to Wearable Acceleration During Crowded Mingling Events , 2016, ACM Multimedia.

[16]  Koichi Hashimoto,et al.  Identifying a moving object with an accelerometer in a camera view , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Hrvoje Benko,et al.  CrossMotion: Fusing Device and Image Motion for User Identification, Tracking and Device Association , 2014, ICMI.

[18]  Gerald Friedland,et al.  Estimating Dominance in Multi-Party Meetings Using Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Mahsan Rofouei,et al.  Your phone or mine?: fusing body, touch and device sensing for multi-user device-display interaction , 2012, CHI.

[20]  Hervé Bourlard,et al.  Automatic Recognition of Emergent Social Roles in Small Group Interactions , 2015, IEEE Transactions on Multimedia.

[21]  Nicu Sebe,et al.  Analyzing Free-standing Conversational Groups: A Multimodal Approach , 2015, ACM Multimedia.

[22]  Mohan S. Kankanhalli,et al.  Temporal encoded F-formation system for social interaction detection , 2013, ACM Multimedia.

[23]  Ekin Gedik,et al.  The MatchNMingle Dataset: A Novel Multi-Sensor Resource for the Analysis of Social Interactions and Group Dynamics In-the-Wild During Free-Standing Conversations and Speed Dates , 2018, IEEE Transactions on Affective Computing.

[24]  D. Gática-Pérez,et al.  A Nonverbal Behavior Approach to Identify Emergent Leaders in Small Groups , 2012, IEEE Transactions on Multimedia.

[25]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[26]  Mohamed Chetouani,et al.  Interpersonal Synchrony: A Survey of Evaluation Methods across Disciplines , 2012, IEEE Transactions on Affective Computing.

[27]  Ekin Gedik,et al.  Personalised models for speech detection from body movements using transductive parameter transfer , 2017, Personal and Ubiquitous Computing.

[28]  David W. Pentico,et al.  Assignment problems: A golden anniversary survey , 2007, Eur. J. Oper. Res..

[29]  Andreas Savvides,et al.  Tasking networked CCTV cameras and mobile phones to identify and localize multiple people , 2010, UbiComp.

[30]  Gwenn Englebienne,et al.  Detecting conversing groups with a single worn accelerometer , 2014, ICMI.

[31]  Miguel A. Labrador,et al.  A Survey on Human Activity Recognition using Wearable Sensors , 2013, IEEE Communications Surveys & Tutorials.

[32]  Lu Zhang,et al.  Structure Preserving Object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.