Assessing the effect of visual servoing on the performance of linear microphone arrays in moving human-robot interaction scenarios

Abstract Social robotics is becoming a reality and voice-based human-robot interaction is essential for a successful human-robot collaborative symbiosis. The main objective of this paper is to assess the effect of visual servoing in the performance of a linear microphone array regarding distant ASR in a mobile, dynamic and non-stationary robotic testbed that can be representative of real HRI scenarios. Visual servoing and image target tracking are different tasks, and this paper focuses on an effect that is rarely addressed in the literature: the dependence of the beamforming directivity on look direction. The datasets required to carry out the study reported here do not exist and had to be generated. A state-of-the-art mobile robotic testbed had to be set up with target speech and noise sources. A linear microphone array was chosen as a case of study and its response was measured. Standard beamforming methods were evaluated with respect to visual servoing: delay-and-sum combined with image tracking; weighted delay-and-sum; and, MVDR also combined with image tracking. The results presented here show that the performance of beamforming methods is dramatically degraded in moving and non-stationary conditions. In this context, visual servoing in HRI can significantly improve the performance of a linear microphone array regarding ASR accuracy. The average reduction in WER achieved when the robot head was steered toward the target speech source was as high as 28.2%. Finally, it is worth highlighting that the methodology adopted here is applicable to any microphone array, linear or not.

[1]  Christian Biemann,et al.  An Open Source Corpus and Recording Software for Distant Speech Recognition with the Microsoft Kinect , 2014, ITG Symposium on Speech Communication.

[2]  Kazuhiro Nakadai,et al.  Moving Sound Source Extraction by Time-Variant Beamforming , 2007, JSAI.

[3]  António J. S. Teixeira,et al.  Human-robot interaction through spoken language dialogue , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[4]  Nicolas Epain,et al.  Comparison of the measured and theoretical performance of a broadband circular microphone array. , 2011, The Journal of the Acoustical Society of America.

[5]  Néstor Becerra Yoma,et al.  DNN-HMM based Automatic Speech Recognition for HRI Scenarios , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[6]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[7]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[8]  Jacob Benesty,et al.  On the Design of Frequency-Invariant Beampatterns With Uniform Circular Microphone Arrays , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Sunit Sivasankaran,et al.  A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions , 2017, Comput. Speech Lang..

[10]  Philippe Souères,et al.  A survey on sound source localization in robotics: From binaural to array processing methods , 2015, Comput. Speech Lang..

[11]  Alessandro Valli,et al.  The design of natural interaction , 2008, Multimedia Tools and Applications.

[12]  Ivan Tashev,et al.  Sound Capture and Processing: Practical Approaches , 2009 .

[13]  Tomohiro Nakatani,et al.  Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Qi Sun,et al.  Design and implementation of human-robot interactive demonstration system based on Kinect , 2012, 2012 24th Chinese Control and Decision Conference (CCDC).

[15]  Long Le,et al.  Cost function for sound source localization with arbitrary microphone arrays , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[16]  Hanafiah Yussof,et al.  Humanoid robot NAO: Review of control and motion exploration , 2011, 2011 IEEE International Conference on Control System, Computing and Engineering.

[17]  Ho-Sub Yoon,et al.  A Deconvolutive Neural Network for Speech Classification With Applications to Home Service Robot , 2010, IEEE Transactions on Instrumentation and Measurement.

[18]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[20]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[21]  Jingdong Chen,et al.  Design of Circular Differential Microphone Arrays , 2015 .

[22]  Takashi Morie,et al.  Hibikino-Musashi@Home 2017 Team Description Paper , 2017, ArXiv.

[23]  Dau-Cheng Lyu,et al.  Multi-style learning with denoising autoencoders for acoustic modeling in the internet of things (IoT) , 2017, Comput. Speech Lang..

[24]  Bin Ma,et al.  Speaker diarization for meeting room audio , 2009, INTERSPEECH.

[25]  Jun Takamatsu,et al.  A gesture-centric Android system for multi-party human-robot interaction , 2013, HRI 2013.

[26]  Michael A. Goodrich,et al.  Human-Robot Interaction: A Survey , 2008, Found. Trends Hum. Comput. Interact..

[27]  Anita Lorenc,et al.  Detecting laterality and nasality in speech with the use of a multi-channel recorder , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Anita Lorenc,et al.  An acoustic camera approach to studying nasality in speech: The case of Polish nasalized vowels. , 2018, The Journal of the Acoustical Society of America.

[29]  Yusuke Hioka,et al.  Sharp directive beamforming using microphone array and planar reflector , 2013 .

[30]  Federico Manuri,et al.  A Kinect-based natural interface for quadrotor control , 2011, Entertain. Comput..

[31]  Kuanhao Zheng,et al.  Designing and Implementing a Human–Robot Team for Social Interactions , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[32]  Satoshi Kagami,et al.  Spherical microphone array for spatial sound localization for a mobile robot , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Shefeng Yan Optimal design of modal beamformers for circular arrays. , 2015, The Journal of the Acoustical Society of America.

[35]  Kenzo Akagiri,et al.  Development of zonal beamformer and its application to robot audition , 2010, 2010 18th European Signal Processing Conference.

[36]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[37]  Tomohiro Nakatani,et al.  Online meeting recognition in noisy environments with time-frequency mask based MVDR beamforming , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[38]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Malcolm J. Crocker,et al.  Theory of Sound—Predictions and Measurement , 2008 .

[40]  Dietrich Paulus,et al.  Trends, Challenges and Adopted Strategies in RoboCup@Home , 2019, 2019 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC).

[41]  Finn Jacobsen,et al.  Deconvolution for the localization of sound sources using a circular microphone array. , 2013, The Journal of the Acoustical Society of America.

[42]  François Chaumette,et al.  Visual Servoing and Visual Tracking , 2008, Springer Handbook of Robotics.

[43]  Akihiko Sugiyama,et al.  A new DOA estimation method using a circular microphone array , 2007, 2007 15th European Signal Processing Conference.

[44]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[45]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[46]  Zi Huang,et al.  Deep-Sea Organisms Tracking Using Dehazing and Deep Learning , 2018, Mob. Networks Appl..

[47]  Shoko Araki,et al.  Meeting Recognition with Asynchronous Distributed Microphone Array Using Block-Wise Refinement of Mask-Based MVDR Beamformer , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Karim Haddad,et al.  3D localization of acoustic sources with a spherical array , 2008 .

[49]  Guy Hoffman,et al.  Effects of robotic companionship on music enjoyment and agent perception , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[50]  Richard M. Stern,et al.  Locally Normalized Filter Banks Applied to Deep Neural-Network-Based Robust Speech Recognition , 2017, IEEE Signal Processing Letters.

[51]  Chen-Yu Chiang,et al.  User identification design by fusion of face recognition and speaker recognition , 2012, 2012 12th International Conference on Control, Automation and Systems.

[52]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[53]  Jean-Luc Gauvain,et al.  Developments in continuous speech dictation using the ARPA WSJ task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54]  Jens Meyer,et al.  Beamforming for a circular microphone array mounted on spherically shaped objects , 2001 .

[55]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[56]  Marco Crocco,et al.  Stochastic and Analytic Optimization of Sparse Aperiodic Arrays and Broadband Beamformers With Robust Superdirective Patterns , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Fabio Valente,et al.  Speaker diarization of meetings based on large TDOA feature vectors , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Boaz Rafaely,et al.  Direction of Arrival Estimation for Reverberant Speech Based on Enhanced Decomposition of the Direct Sound , 2019, IEEE Journal of Selected Topics in Signal Processing.

[59]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[60]  Jacob Benesty,et al.  On the Design of Robust Steerable Frequency-Invariant Beampatterns with Concentric Circular Microphone Arrays , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[62]  Birger Kollmeier,et al.  Perception of Speech and Sound , 2008 .