Multidomain Voice Activity Detection during Human-Robot Interaction

The continuous increase of social robots is leading quickly to the cohabitation of humans and social robots at homes. The main way of interaction in these robots is based on verbal communication. Usually social robots are endowed with microphones to receive the voice signal of the people they interact with. However, due to the principle the microphones are based on, they receive all kind of non verbal signals too. Therefore, it is crucial to differentiate whether the received signal is voice or not. In this work, we present a Voice Activity Detection (VAD) system to manage this problem. In order to achieve it, the audio signal captured by the robot is analyzed on-line and several characteristics, or statistics, are extracted. The statistics belong to three different domains: the time, the frequency, and the time-frequency. The combination of these statistics results in a robust VAD system that, by means of the microphones located in a robot, is able to detect when a person starts to talk and when he ends. Finally, several experiments are conducted to test the performance of the system. These experiments show a high percentage of success in the classification of different audio signal as voice or unvoice.

[1]  María Malfaz,et al.  User Localization During Human-Robot Interaction , 2012, Sensors.

[2]  Trieu-Kien Truong,et al.  Improved voice activity detection algorithm using wavelet and support vector machine , 2010, Comput. Speech Lang..

[3]  Xiaoling Yang,et al.  Comparative Study on Voice Activity Detection Algorithm , 2010, 2010 International Conference on Electrical and Control Engineering.

[4]  Fernando Alonso-Martín,et al.  INTEGRATION OF A VOICE RECOGNITION SYSTEM IN A SOCIAL ROBOT , 2011, Cybern. Syst..

[5]  Tatsuya Kawahara,et al.  Online Unsupervised Classification With Model Comparison in the Variational Bayes Framework for Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[6]  Perry R. Cook,et al.  Support for MIR Prototyping and Real-Time Applications in the ChucK Programming Language , 2008, ISMIR.

[7]  Amir Akramin Shafie,et al.  Artificial neural network based autoregressive modeling technique with application in voice activity detection , 2012, Eng. Appl. Artif. Intell..

[8]  M. H. Moattar,et al.  A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection , 2011 .

[9]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[10]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[11]  Geoff Wyvill,et al.  A Smarter Way to Find pitch , 2005, ICMC.

[12]  Mohammad Hossein Moattar,et al.  A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[13]  M.R. Raghuveer,et al.  Bispectrum estimation: A digital signal processing framework , 1987, Proceedings of the IEEE.

[14]  Mohammad Hossein Moattar,et al.  A new approach for robust realtime Voice Activity Detection using spectral pattern , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Yan Feng,et al.  Voice activity detection based on the bispectrum , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[16]  Alexander Lerch,et al.  Hierarchical Automatic Audio Signal Classification , 2004 .

[17]  Se-Young Kim,et al.  Quick audio retrieval using multiple feature vectors , 2006, 2006 Digest of Technical Papers International Conference on Consumer Electronics.