Cross-Modal Analysis Between Phonation Differences and Texture Images Based on Sentiment Correlations

Motivated by the success of speech characteristics representation by color attributes, we analyzed the cross-modal sentiment correlations between voice source characteristics and textural image characteristics. For the analysis, we employed vowel sounds with representative three phonation differences (modal, creaky and breathy) and 36 texture images with 36 semantic attributes (e.g., banded, cracked and scaly) annotated one semantic attribute for each texture. By asking 40 subjects to select the most fitted textures from 36 figures with different textures after listening 30 speech samples with different phonations, we measured the correlations between acoustic parameters showing voice source variations and the parameters of selected textural image differences showing coarseness, contrast, directionality, busyness, complexity and strength. From the texture classifications, voice characteristics can be roughly characterized by textural differences: modalgauzy, banded and smeared, creaky porous, crystalline, cracked and scaly, breathy smeared, freckled and stained. We have also found significant correlations between voice source acoustic parameters and textural parameters. These correlations suggest the possibility of cross-modal mapping between voice source characteristics and textural parameters, which enables visualization of speech information with source variations reflecting human sentiment perception.

[1]  Yoshinori Sagisaka,et al.  Analysis on paralinguistic prosody control in perceptual impression space using multiple dimensional scaling , 2009, Speech Commun..

[2]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Yoshinori Sagisaka,et al.  Cross-modal description of sentiment information embedded in speech , 2015, ICPhS.

[4]  Robert King,et al.  Textural features corresponding to textural properties , 1989, IEEE Trans. Syst. Man Cybern..

[5]  Guus de Krom,et al.  A Cepstrum-Based Technique for Determining a Harmonics-to-Noise Ratio in Speech Signals , 1993 .

[6]  Rachel Smith,et al.  Color and texture associations in voice-induced synesthesia , 2013, Front. Psychol..

[7]  H M Hanson,et al.  Glottal characteristics of female speakers: acoustic correlates. , 1997, The Journal of the Acoustical Society of America.

[8]  Yoshinori Sagisaka,et al.  Generation and perception of F0 markedness for communicative speech synthesis , 2005, Speech Commun..

[9]  Yoshinori Sagisaka,et al.  Communicative speech synthesis using constituent word attributes , 2005, INTERSPEECH.

[10]  Christos P. Loizou,et al.  Despeckle filtering software toolbox for ultrasound imaging of the common carotid artery , 2014, Comput. Methods Programs Biomed..

[11]  Sameer ud Dowla Khan An acoustic and electroglottographic study of breathy phonation in Gujarati. , 2009 .

[12]  A. Ravishankar Rao,et al.  The Texture Lexicon: Understanding the Categorization of Visual Texture Terms and Their Relationship to Texture Images , 1997, Cogn. Sci..

[13]  Yoshinori Sagisaka,et al.  Global F0 control parameter prediction based on impressions for communicative prosody generation , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[14]  Ke Li,et al.  Inter-language prosodic style modification experiment using word impression vector for communicative speech generation , 2007, INTERSPEECH.

[15]  Yoshinori Sagisaka,et al.  Sentiment analysis of color attributes derived from vowel sound impression for multimodal expression , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[16]  J. Hillenbrand,et al.  Acoustic correlates of breathy vocal quality. , 1994, Journal of speech and hearing research.

[17]  Abeer Alwan,et al.  The relationship between open quotient and H1*-H2*. , 2008 .

[18]  Patricia A. Keating,et al.  Voicesauce: A Program for Voice Analysis , 2009, ICPhS.

[19]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.