Clustering Persian viseme using phoneme subspace for developing visual speech application

There are numerous multimedia applications such as talking head, lip reading, lip synchronization, and computer assisted pronunciation training, which entices researchers to bring clustering and analyzing viseme into focus. With respect to the fact that clustering and analyzing visemes are language dependent process, we concentrated our research on Persian language, which indeed has suffered from the lack of such study. To this end, we proposed a novel adopting image-based approach which consists of four main steps including (a) extracting the lip region, (b) obtaining Eigenviseme of each phoneme considering coarticulation effect, (c) mapping each viseme into its subspace and other phonemes’ subspaces in order to create the distance matrix so as to calculate the distance between viseme’s cluster, and finally (d) comparing similarity of each viseme based on the weight value of reconstructed one. In order to indicate the robustness of the proposed algorithm, three sets of experiments were conducted on Persian and English databases in which Consonant/Vowel and Consonant/Vowel/Consonant syllables were examined. The results indicated that the proposed method outperformed the observed state-of-the-art algorithms in feature extraction, and it had a comparable efficiency in generating adequate clusters. Moreover, obtained results reached a milestone in grouping Persian visemes with respect to the perceptual test given by volunteers.

[1]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[2]  Thoms M. Levergood,et al.  DECface: A system for synthetic face applications , 1995, Multimedia Tools and Applications.

[3]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Petr Císař,et al.  Viseme analysis for speech-driven facial animation for Czech audio-visual speech synthesis , 2005 .

[5]  M. Pichora-Fuller,et al.  Coarticulation effects in lipreading. , 1982, Journal of speech and hearing research.

[6]  Javier Melenchón,et al.  Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes , 2007, AVSP.

[7]  Mikko Sams,et al.  Parameterized visual speech synthesis and its evaluation , 2000, 2000 10th European Signal Processing Conference.

[8]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[9]  Anton Nijholt,et al.  Classifying Visemes for Automatic Lipreading , 1999, TSD.

[10]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[11]  Ilse Lehiste,et al.  Coarticulation Effects in the Identification of Final Plosives , 1972 .

[12]  Caroline Henton,et al.  Generating and manipulating emotional synthetic speech on a personal computer , 1996, Multimedia Tools and Applications.

[13]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[14]  John H. L. Hansen,et al.  DSP for In-Vehicle and Mobile Systems , 2014 .

[15]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[16]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[17]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[18]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[19]  Olle Bälter,et al.  Wizard-of-Oz test of ARTUR: a computer-based speech training system with articulation correction , 2005, Assets '05.

[20]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[21]  Mohammad Aghaahmadi,et al.  Persian Viseme Classification for Developing Visual Speech Training Application , 2009, PCM.

[22]  Walid Mahdi,et al.  Lip Localization and Viseme Classification for Visual Speech Recognition , 2013, ArXiv.

[23]  Aggelos K. Katsaggelos,et al.  Frame Rate and Viseme Analysis for Multimedia Applications to Assist Speechreading , 1998, J. VLSI Signal Process..

[24]  Horst Bunke,et al.  Sentence Lipreading Using Hidden Markov Model with Integrated Grammar , 2001, Int. J. Pattern Recognit. Artif. Intell..

[25]  Bernard Tiddeman,et al.  Prototyping and transforming visemes for animated speech , 2002, Proceedings of Computer Animation 2002 (CA 2002).

[26]  Hitoshi Kiya,et al.  Proceedings of the Advances in multimedia information processing, and 11th Pacific Rim conference on Multimedia: Part II , 2010 .

[27]  Phillip A. Laplante,et al.  A multimedia speech learning system for the hearing impaired , 1996, Multimedia Tools and Applications.

[28]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[29]  Sherif Abdou,et al.  Audio-visual phoneme classification for pronunciation training applications , 2007, INTERSPEECH.

[30]  Wladyslaw Skarbek,et al.  Viseme Classification for Talking Head Application , 2005, CAIP.

[31]  Mohammad Aghaahmadi,et al.  The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation , 2010, MMM.

[32]  Hakan Erdogan,et al.  Audio-visual speech recognition in vehicular noise using a multi-classifier approach , 2007 .

[33]  Christophe Garcia,et al.  A Wavelet-based Framework for Face Recognition , 1998 .

[34]  Hedvig Kjellström,et al.  Audiovisual-to-articulatory inversion , 2009, Speech Commun..

[35]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[36]  Mohammad Aghaahmadi,et al.  A comprehensive audio-visual corpus for teaching sound Persian phoneme articulation , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[37]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[38]  Anders Löfqvist Vowel-to-vowel coarticulation in Japanese: the effect of consonant duration. , 2009, The Journal of the Acoustical Society of America.

[39]  R. Safabakhsh,et al.  AUT-Talk: A Farsi Talking Head , 2006, 2006 2nd International Conference on Information & Communication Technologies.

[40]  Azam Bastanfard,et al.  A Novel Multimedia Educational Speech Therapy System for Hearing Impaired Children , 2010, PCM.