Multi-party Human-Machine Interaction Using a Smart Multimodal Digital Signage

In this paper, we present a novel multimodal system designed for smooth multi-party human-machine interaction. HCI for multiple users is challenging because simultaneous actions and reactions have to be consistent. Here, the proposed system consists of a digital signage or large display equipped with multiple sensing devices: a 19-channel microphone array, 6 HD video cameras (3 are placed on top and 3 on the bottom of the display), and two depth sensors. The display can show various contents, similar to a poster presentation, or multiple windows (e.g., web browsers, photos, etc.). On the other hand, multiple users positioned in front of the panel can freely interact using voice or gesture while looking at the displayed contents, without wearing any particular device (such as motion capture sensors or head mounted devices). Acoustic and visual information processing are performed jointly using state-of-the-art techniques to obtain individual speech and gaze direction. Hence displayed contents can be adapted to users' interests.

[1]  Anton Nijholt,et al.  Meeting behavior detection in smart environments: Nonverbal cues that help to obtain natural interaction , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[2]  Sheida White Backchannels across cultures: A study of Americans and Japanese , 1989, Language in Society.

[3]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[5]  Hao Jiang,et al.  User-oriented document summarization through vision-based eye-tracking , 2009, IUI.

[6]  Tatsuya Kawahara,et al.  Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Luc Van Gool,et al.  Real Time Head Pose Estimation from Consumer Depth Cameras , 2011, DAGM-Symposium.

[8]  Takahiro Okabe,et al.  Inferring human gaze from appearance via adaptive linear regression , 2011, 2011 International Conference on Computer Vision.

[9]  Fabio Pianesi,et al.  A multimodal annotated corpus of consensus decision making meetings , 2007, Lang. Resour. Evaluation.

[10]  Kazuhiro Nakadai,et al.  Adaptive step-size parameter control for real-world blind source separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Bernd Neumann,et al.  Computer Vision — ECCV’98 , 1998, Lecture Notes in Computer Science.

[12]  Toyoaki Nishida,et al.  Analysis environment of conversational structure with nonverbal multimodal data , 2010, ICMI-MLMI '10.

[13]  Takashi Matsuyama,et al.  Topology Dictionary for 3D Video Understanding , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Tatsuya Kawahara,et al.  Group Dynamics and Multimodal Interaction Modeling Using a Smart Digital Signage , 2012, ECCV Workshops.

[15]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[16]  Matthieu Guillaumin,et al.  Segmentation Propagation in ImageNet , 2012, ECCV.

[17]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[18]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[20]  Richard M. Stern,et al.  Robust speech recognition , 1997 .

[21]  Andrei Popescu-Belis,et al.  Machine Learning for Multimodal Interaction , 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers , 2008, MLMI.

[22]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .