Far-Field Audio-Visual Scene Perception of Multi-Party Human-Robot Interaction for Children and Adults

Human-robot interaction (HRI) is a research area of growing interest with a multitude of applications for both children and adult user groups, as, for example, in edutainment and social robotics. Crucial, however, to its wider adoption remains the robust perception of HRI scenes in natural, untethered, and multi-party interaction scenarios, across user groups. Towards this goal, we investigate three focal HRI perception modules operating on data from multiple audio-visual sensors that observe the HRI scene from the far-field, thus bypassing limitations and platform-dependency of contemporary robotic sensing. In particular, the developed modules fuse intra- and/or inter-modality data streams to perform: (i) audio-visual speaker localization; (ii) distant speech recognition; and (iii) visual recognition of hand-gestures. Emphasis is also placed on ensuring high speech and gesture recognition rates for both children and adults. Development and objective evaluation of the three modules is conducted on a corpus of both user groups, collected by our far-field multisensory setup, for an interaction scenario of a question-answering “guess-the-object” collaborative HRI game with a “Furhat” robot. In addition, evaluation of the game incorporating the three developed modules is reported. Our results demonstrate robust far-field audio-visual perception of the multi-party HRI scene.

[1]  S. Kopp,et al.  Towards Adaptive Social Behavior Generation for Assistive Robots Using Reinforcement Learning , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[2]  Marina Fridin,et al.  Acceptance of socially assistive humanoid robot by preschool and elementary school teachers , 2014, Comput. Hum. Behav..

[3]  Ying Yu,et al.  A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Sergio Escalera,et al.  A real-time Human-Robot Interaction system based on gestures for assistive scenarios , 2016, Comput. Vis. Image Underst..

[5]  Gabriel Skantze,et al.  IrisTK: a statechart-based toolkit for multi-party face-to-face interaction , 2012, ICMI '12.

[6]  Björn W. Schuller,et al.  Automatic Classification of Autistic Child Vocalisations: A Novel Database and Results , 2017, INTERSPEECH.

[7]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  David Harel,et al.  Statecharts: A Visual Formalism for Complex Systems , 1987, Sci. Comput. Program..

[9]  Satoshi Nakamura,et al.  A Robust Speech Recognition System for Communication Robots in Noisy Environments , 2008, IEEE Transactions on Robotics.

[10]  Petros Maragos,et al.  Room-localized spoken command recognition in multi-room, multi-microphone environments , 2017, Comput. Speech Lang..

[11]  Joakim Gustafson,et al.  Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Feedback Utterances , 2016, INTERSPEECH.

[12]  Petros Maragos,et al.  A Platform for Building New Human-Computer Interface Systems that Support Online Automatic Recognition of Audio-Gestural Commands , 2016, ACM Multimedia.

[13]  Maurizio Omologo,et al.  Hidden Markov model training with contaminated speech material for distant-talking speech recognition , 2002, Comput. Speech Lang..

[14]  Patrick A. Naylor,et al.  Source tracking using moving microphone arrays for robot audition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Yiannis Demiris,et al.  Creative Dance: An Approach for Social Interaction between Robots and Children , 2013, HBU.

[17]  Michael A. Goodrich,et al.  Human-Robot Interaction: A Survey , 2008, Found. Trends Hum. Comput. Interact..

[18]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[19]  Tom Ziemke,et al.  How to Build a Supervised Autonomous System for Robot-Enhanced Therapy for Children with Autism Spectrum Disorder , 2017, Paladyn J. Behav. Robotics.

[20]  Ivan Tashev Kinect Development Kit: A Toolkit for Gesture- and Speech-Based Human-Machine Interaction [Best of the Web] , 2013, IEEE Signal Processing Magazine.

[21]  Alessio Brutti,et al.  Classification of Acoustic Maps to Determine Speaker Position and Orientation from a Distributed Microphone Network , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Petros Maragos,et al.  Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[24]  Nikos Chatzichrisafis,et al.  Large vocabulary continuous speech recognition in greek: corpus and an automatic dictation system , 2003, INTERSPEECH.

[25]  Tony Belpaeme,et al.  Higher Nonverbal Immediacy Leads to Greater Learning Gains in Child-Robot Tutoring Interactions , 2015, ICSR.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Petros Maragos,et al.  Social Human-Robot Interaction for the Elderly: Two Real-life Use Cases , 2017, HRI.

[28]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[29]  Walter Kellermann,et al.  Challenges in Acoustic Signal Enhancement for Human-Robot Communication , 2014, ITG Symposium on Speech Communication.

[30]  Gabriel Skantze,et al.  Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human-Machine Interaction , 2011, COST 2102 Training School.

[31]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[32]  Nadia Magnenat Thalmann,et al.  Nadine: A Social Robot that Can Localize Objects and Grasp Them in a Human Way , 2017 .

[33]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[34]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[35]  Αιμίλιος Χαλαμανδάρης,et al.  The ILSP/INNOETICS Text-to-Speech System for the Blizzard Challenge 2013 , 2013 .

[36]  Gabriel Skantze,et al.  Predicting and Regulating Participation Equality in Human-Robot Conversations: Effects of Age and Gender , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[37]  José Carlos González,et al.  Evaluating the Child–Robot Interaction of the NAOTherapist Platform in Pediatric Rehabilitation , 2017, Int. J. Soc. Robotics.

[38]  Radu Horaud,et al.  Active-speaker detection and localization with microphones and cameras embedded into a robotic head , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[39]  Cláudio Rosito Jung,et al.  Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM , 2015, IEEE Transactions on Multimedia.

[40]  Radu Horaud,et al.  Audio-visual tracking by density approximation in a sequential Bayesian filtering framework , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).