RAVEL: an annotated corpus for training robots with audiovisual abilities

We introduce Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios. These scenarios are recorded using the audio-visual robot head POPEYE, equipped with two cameras and four microphones, two of which being plugged into the ears of a dummy head. All the recordings were performed in a standard room with no special equipment, thus providing a challenging indoor scenario. This data set provides a basis to test and benchmark methods and algorithms for audio-visual scene analysis with the ultimate goal of enabling robots to interact with people in the most natural way. The data acquisition setup, sensor calibration, data annotation and data content are fully detailed. Moreover, three examples of using the recorded data are provided, illustrating its appropriateness for carrying out a large variety of HRI experiments. The Ravel data are publicly available at: http://ravel.humavips.eu/.

[1]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  Andrew Zisserman,et al.  Multiple View Geometry in Computer Vision (2nd ed) , 2003 .

[3]  Shaogang Gong,et al.  Audio- and Video-based Biometric Person Authentication , 1997, Lecture Notes in Computer Science.

[4]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Dima Damen,et al.  Proceedings of the British Machine Vision Conference , 2014, BMVC 2014.

[7]  Luc Van Gool,et al.  Exemplar-based Action Recognition in Video , 2009, BMVC.

[8]  David Elliott,et al.  In the Wild , 2010 .

[9]  Radu Horaud,et al.  Scene flow estimation by growing correspondence seeds , 2011, CVPR 2011.

[10]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[11]  Takeo Kanade,et al.  Three-dimensional scene flow , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Jon Barker,et al.  An automatic speech recognition system based on the scene analysis account of auditory perception , 2007, Speech Commun..

[14]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[16]  Chi-Ho Chan,et al.  On the Results of the First Mobile Biometry (MOBIO) Face and Speaker Verification Evaluation , 2010, ICPR Contests.

[17]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[18]  Radu Horaud,et al.  Cyclopean Geometry of Binocular Vision , 2008, Journal of the Optical Society of America. A, Optics, image science, and vision.

[19]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[20]  Jon Barker,et al.  The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements , 2008, ICMI '08.

[21]  Yasser F. O. Mohammad,et al.  The H3R Explanation Corpus human-human and base human-robot interaction dataset , 2008, 2008 International Conference on Intelligent Sensors, Sensor Networks and Information Processing.

[22]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[23]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  H. Opower Multiple view geometry in computer vision , 2002 .

[25]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Uwe D. Hanebeck,et al.  The KIT Robo-kitchen data set for the evaluation of view-based activity recognition systems , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[27]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[28]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Hennie Brugman,et al.  Annotating Multi-media/Multi-modal Resources with ELAN , 2004, LREC.

[30]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[32]  Moritz Tenorth,et al.  The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[33]  Ren C. Luo,et al.  Multisensor integration and fusion in intelligent systems , 1989, IEEE Trans. Syst. Man Cybern..

[34]  Radu Horaud,et al.  Conjugate Mixture Models for Clustering Multimodal Data , 2011, Neural Computation.

[35]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[36]  Munsang Kim,et al.  Human-Robot Interaction in Real Environments by Audio-Visual Integration , 2007 .

[37]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[38]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[39]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[41]  Ben J. A. Kröse,et al.  From Sensors to Human Spatial Concepts: An Annotated Data Set , 2008, IEEE Transactions on Robotics.

[42]  Radu Horaud,et al.  Finding audio-visual events in informal social gatherings , 2011, ICMI '11.