Multiparty Interaction Understanding Using Smart Multimodal Digital Signage

This paper presents a novel multimodal system designed for multi-party human-human interaction analysis. The design of human-machine interfaces for multiple users is challenging because simultaneous processing of actions and reactions have to be consistent. The proposed system consists of a large display equipped with multiple sensing devices: microphone array, HD video cameras, and depth sensors. Multiple users positioned in front of the panel freely interact using voice or gesture while looking at the displayed content, without wearing any particular devices (such as motion capture sensors or head mounted devices). Acoustic and visual information is captured and processed jointly using established and state-of-the-art techniques to obtain individual speech and gaze direction. Furthermore, a new framework is proposed to model A/V multimodal interaction between verbal and nonverbal communication events. Dynamics of audio signals obtained from speaker diarization and head poses extracted from video images are modeled using hybrid dynamical systems (HDS). We show that HDS temporal structure characteristics can be used for multimodal interaction level estimation, which is useful feedback that can help to improve multi-party communication experience. Experimental results using synthetic and real-world datasets of group communication such as poster presentations show the feasibility of the proposed multimodal system.

[1]  Fabio Pianesi,et al.  A multimodal annotated corpus of consensus decision making meetings , 2007, Lang. Resour. Evaluation.

[2]  Kazuhiro Nakadai,et al.  Adaptive step-size parameter control for real-world blind source separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Tatsuya Kawahara,et al.  Prediction of Turn-Taking by Combining Prosodic and Eye-Gaze Information in Poster Conversations , 2012, INTERSPEECH.

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[5]  Takashi Matsuyama,et al.  Timing-Based Local Descriptor for Dynamic Surfaces , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[7]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[8]  Anton Nijholt,et al.  Meeting behavior detection in smart environments: Nonverbal cues that help to obtain natural interaction , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[9]  Toyoaki Nishida,et al.  Analysis environment of conversational structure with nonverbal multimodal data , 2010, ICMI-MLMI '10.

[10]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[11]  Tatsuya Kawahara,et al.  Multi-party human-robot interaction with distant-talking speech recognition , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[12]  Luc Van Gool,et al.  Real Time Head Pose Estimation from Consumer Depth Cameras , 2011, DAGM-Symposium.

[13]  Nuno Vasconcelos,et al.  Mixtures of dynamic textures , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Pau-Choo Chung,et al.  An Interaction-Embedded HMM Framework for Human Behavior Understanding: With Nursing Environments as Examples , 2010, IEEE Transactions on Information Technology in Biomedicine.

[15]  Tatsuya Kawahara,et al.  Multi-party Human-Machine Interaction Using a Smart Multimodal Digital Signage , 2013, HCI.

[16]  Tatsuya Kawahara,et al.  Estimation of interest and comprehension level of audience through multi-modal behaviors in poster conversations , 2013, INTERSPEECH.

[17]  Takashi Matsuyama,et al.  Topology Dictionary for 3D Video Understanding , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Tatsuya Kawahara,et al.  Group Dynamics and Multimodal Interaction Modeling Using a Smart Digital Signage , 2012, ECCV Workshops.

[19]  Jean Carletta,et al.  Extractive summarization of meeting recordings , 2005, INTERSPEECH.

[20]  Keisuke Nakamura,et al.  Automatic Distance Compensation for Robust Voice-based Human-Computer Interaction , 2013 .

[21]  David A. Freedman,et al.  Statistics: Fourth edition , 2007 .

[22]  Tatsuya Kawahara,et al.  Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[24]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Sheida White Backchannels across cultures: A study of Americans and Japanese , 1989, Language in Society.

[26]  Hao Jiang,et al.  User-oriented document summarization through vision-based eye-tracking , 2009, IUI.

[27]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[28]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[30]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[31]  Takashi Matsuyama,et al.  3D Video and Its Applications , 2012, Springer London.

[32]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[33]  T. Matsuyama,et al.  Human Motion Tracking using a Color-Based Particle Filter Driven by Optical Flow , 2008 .

[34]  F. Asano,et al.  An optimum computer‐generated pulse signal suitable for the measurement of very long impulse responses , 1995 .

[35]  Takashi Matsuyama,et al.  Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Andrea J. van Doorn,et al.  Surface shape and curvature scales , 1992, Image Vis. Comput..

[37]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[38]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, ACM Trans. Graph..

[39]  Takahiro Okabe,et al.  Inferring human gaze from appearance via adaptive linear regression , 2011, 2011 International Conference on Computer Vision.

[40]  Takashi Matsuyama,et al.  Interval-based Modeling of Human Communication Dynamics via Hybrid Dynamical Systems , 2010 .

[41]  Takashi Matsuyama,et al.  Intrinsic Characterization of Dynamic Surfaces , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.