Toward Clustering Persian Vowel Viseme: A New Clustering Approach based on HMM

This paper sorts out the problem of Persian Vowel viseme clustering. Clustering audio-visual data has been discussed for a decade or so. However, it is an open problem due to shortcoming of appropriate data and its dependency to target language. Here, we propose a speaker-independent and robust method for Persian viseme class identification as our main contribution. The overall process of the proposed method consists of three main steps including (I) Mouth region segmentation, (II) Feature extraction, and (IV) Hierarchical clustering. After segmenting the mouth region in all frames, the feature vectors are extracted based on a new look at Hidden Markov Model. This is another contribution to this work, which utilizes HMM as a probabilistic model-based feature detector. Finally, a hierarchical clustering approach is utilized to cluster Persian Vowel viseme. The main advantage of this work over others is producing a single clustering output for all subjects, which can simplify the research process in other applications. In order to prove the efficiency of the proposed method a set of experiments is conducted on AVAII.

[1]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[2]  Manuele Bicego,et al.  A Hidden Markov Model-Based Approach to Sequential Data Clustering , 2002, SSPR/SPR.

[3]  Javier Melenchón,et al.  Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes , 2007, AVSP.

[4]  Piero Cosi,et al.  Statistical Definition of Visual Information for Italian Vowels and Consonants , 1998, AVSP.

[5]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[6]  Paul F. Whelan,et al.  A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition , 2010, IPSJ Trans. Comput. Vis. Appl..

[7]  Walid Mahdi,et al.  Lip Localization and Viseme Classification for Visual Speech Recognition , 2013, ArXiv.

[8]  Jindrich Matousek,et al.  Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis , 2006, Signal Process..

[9]  Mohammad Aghaahmadi,et al.  The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation , 2010, MMM.

[10]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Mohammad Aghaahmadi,et al.  Persian Viseme Classification for Developing Visual Speech Training Application , 2009, PCM.

[13]  Mohammad Aghaahmadi,et al.  A comprehensive audio-visual corpus for teaching sound Persian phoneme articulation , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[14]  Igor S. Pandzic,et al.  Real-time language independent lip synchronization method using a genetic algorithm , 2006, Signal Process..

[15]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[16]  WANG Anhong,et al.  Primary research on the viseme system in Standard Chinese , 2000 .