MULTIMODAL SPEAKER IDENTITY CONVERSION - CONTINUED

Being able to convert a given the speech and facial movements of a given source speaker into those of another (identified) target speaker, is a challenging problem. In this paper we build on the experience gained in a previous eNTERFACE workshop to produce a working, although still very imperfect, identity conversion system. The conversion system we develop is based on the late fusion of two independently obtained conversion results: voice conversion and facial movement conversion. In an attempt to perform parallel conversion of the glottal source and excitation tract features of speech, we examine the usability of the ARX-LF source-filter model of speech. Given its high sensitivity to parameter modification, we then use the code-book based STASC model. For face conversion, we first build 3D facial models of the source and target speakers, using the MPEG-4 standard. Facial movements are then tracked using the Active Appearance Model approach, and facial movement mapping is obtained by imposing source FAPs on the 3D model of the target, and using the target FAPUs to interpret the source FAPs.

[1]  Levent M. Arslan,et al.  Robust processing techniques for voice conversion , 2006, Comput. Speech Lang..

[2]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[3]  Stephen Wilson,et al.  Combined Gesture-Speech Analysis and Synthesis , 2005 .

[4]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[5]  Olivier Rosec,et al.  A New Method for Speech Synthesis and Transformation Based on an ARX-LF Source-Filter Decomposition and HNM Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[7]  Thomas S. Huang,et al.  Tracking facial features using probabilistic network , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[8]  Koray Balci Xface: MPEG-4 based open source toolkit for 3D Facial Animation , 2004, AVI.

[9]  T. Dutoit,et al.  Multimodal Speaker Conversion — his master ’ s voice . . . and face — , 2006 .

[10]  Hans-Peter Seidel,et al.  Texturing Faces , 2002, Graphics Interface.