Speech-driven talking face using embedded confusable system for real time mobile multimedia

This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.

[1]  Feng Jiang,et al.  Based on HMM and SVM multilayer architecture classifier for Chinese sign language recognition with large vocabulary , 2004, Third International Conference on Image and Graphics (ICIG'04).

[2]  Jhing-Fa Wang,et al.  Critical Band Subspace-Based Speech Enhancement Using SNR and Auditory Masking Aware Technique , 2007, IEICE Trans. Inf. Syst..

[3]  Barry-John Theobald,et al.  A real-time speech-driven talking head using active appearance models , 2007, AVSP.

[4]  Esko Turunen Survey of Theory and Applications of Łukasiewicz-Pavelka Fuzzy Logic , 2001 .

[5]  David F. McAllister,et al.  Lip synchronization for animation , 1997, SIGGRAPH '97.

[6]  Hanseok Ko,et al.  Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-Mixture HMM With HBT Structure for Speech-Driven Lip-Sync , 2008, IEEE Transactions on Multimedia.

[7]  Keith Waters,et al.  Computer Facial Animation, Second Edition , 1996 .

[8]  Shigeo Morishima,et al.  Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment , 1998, AVSP.

[9]  Daniel Thalmann,et al.  Models and Techniques in Computer Animation , 2014, Computer Animation Series.

[10]  Andrej Zgank,et al.  Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering , 2001, INTERSPEECH.

[11]  Irek Defée,et al.  Performance of similarity measures based on histograms of local image feature vectors , 2007, Pattern Recognit. Lett..

[12]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[13]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[14]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[15]  Keith Waters,et al.  Computer facial animation , 1996 .

[16]  D. Bitzer,et al.  Automated lip-sync: direct translation of speech-sound to mouth-shape , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[17]  Giangiacomo Gerla,et al.  Lectures on Soft Computing and Fuzzy Logic , 2001 .

[18]  Bojan Imperl,et al.  The clustering algorithm for the definition of multilingual set of context dependent speech models , 1999, EUROSPEECH.

[19]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[20]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[21]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[22]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  Jhing-Fa Wang,et al.  Robust Environmental Sound Recognition for Home Automation , 2008, IEEE Transactions on Automation Science and Engineering.

[24]  Fabio Lavagetto,et al.  LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[25]  Patrick Lucey,et al.  Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments , 2004 .

[26]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[27]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[28]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[29]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[30]  Dongsuk Yook,et al.  Audio-to-Visual Conversion Using Hidden Markov Models , 2002, PRICAI.

[31]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[32]  Gavin C. Cawley,et al.  Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[33]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.