3D vision technology for capturing multimodal corpora: chances and challenges

Data annotation is the most labor-intensive part for the acquisition of a multimodal corpus. 3D vision technology can ease the annotation process, especially when continuous surface deformations need to be extracted accurately and consistently over time. In this paper, we give an example use of such technology, namely the acquisition of an audio-visual corpus comprising detailed dynamic face geometry, transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. By means of the example, we will discuss the advantages and challenges of integrating non-invasive 3D vision capture techniques into a setup for recording multimodal data.

[1]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[2]  Olga Sorkine-Hornung,et al.  On Linear Variational Surface Deformation Methods , 2008, IEEE Transactions on Visualization and Computer Graphics.

[3]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[4]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[7]  Michael Kipp Spatiotemporal Coding in ANVIL , 2008, LREC.

[8]  Harald Romsdorfer,et al.  Polyglot text-to-speech synthesis , 2009 .

[9]  Luc Van Gool,et al.  Acquisition of a 3D audio-visual corpus of affective speech , 2010 .

[10]  Harald Romsdorfer,et al.  Text analysis and language identification for polyglot text-to-speech synthesis , 2007, Speech Commun..

[11]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[12]  Harald Romsdorfer,et al.  Polyglot text to speech synthesis: text analysis & prosody control , 2009 .

[13]  T. Andriacchi,et al.  THE THERAPEUTIC POTENTIAL FOR CHANGING PATTERNS OF LOCOMOTION : AN APPLICANTION TO THE ACL DEFICENT KNEE , 2022 .

[14]  Luc Van Gool,et al.  In-hand scanning with online loop closure , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[15]  Luc Van Gool,et al.  Fast 3D Scanning with Automatic Motion Compensation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.