Speaker trait characterization in web videos: Uniting speech, language, and facial features

We present a multi-modal approach to speaker characterization using acoustic, visual and linguistic features. Full realism is provided by evaluation on a database of real-life web videos and automatic feature extraction including face and eye detection, and automatic speech recognition. Different segmentations are evaluated for the audio and video streams, and the statistical relevance of Linguistic Inquiry and Word Count (LIWC) features is confirmed. In the result, late multimodal fusion delivers 73, 92 and 73% average recall in binary age, gender and race classification on unseen test subjects, outperforming the best single modalities for age and race.

[1]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[2]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[3]  Todor Ganchev,et al.  Estimation of unknown speaker’s height from speech , 2009, Int. J. Speech Technol..

[4]  S. Xue,et al.  Normative standards for vocal tract dimensions by race as measured by acoustic pharyngometry. , 2006, Journal of voice : official journal of the Voice Foundation.

[5]  Björn W. Schuller,et al.  Semantic Speech Tagging: Towards Combined Analysis of Speaker Traits , 2011, Semantic Audio.

[6]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[7]  WeningerFelix,et al.  YouTube Movie Reviews , 2013 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Danielle S. McNamara,et al.  Using LIWC and Coh-Metrix to Investigate Gender Differences in Linguistic Styles , 2012 .

[10]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[11]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[12]  Sadiye Guler,et al.  Automated person categorization for video surveillance using soft biometrics , 2010, Defense + Commercial Sensing.

[13]  Abdenour Hadid,et al.  Analyzing Facial Behavioral Features from Videos , 2011, HBU.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Thomas S. Huang,et al.  Audio-visual gender recognition , 2007, International Symposium on Multispectral Image Processing and Pattern Recognition.

[16]  Björn Schuller,et al.  YouTube Movie Reviews: In, Cross, and Open-domain Sentiment Analysis in an Audiovisual Context , 2013 .

[17]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[18]  Doina Precup,et al.  Soft biometric trait classification from real-world face videos conditioned on head pose estimation , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[20]  Mathew Magimai.-Doss,et al.  Integrating audio and vision for robust automatic gender recognition , 2008 .

[21]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[22]  Daniel Gillick,et al.  Can conversational word usage be used to predict speaker demographics? , 2010, INTERSPEECH.

[23]  Björn W. Schuller,et al.  The Computational Paralinguistics Challenge [Social Sciences] , 2012, IEEE Signal Processing Magazine.

[24]  Matti Pietikäinen,et al.  Combining motion and appearance for gender classification from video sequences , 2008, 2008 19th International Conference on Pattern Recognition.

[25]  José Miguel Buenaposada,et al.  Revisiting Linear Discriminant Techniques in Gender Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Florian Metze,et al.  Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[27]  Björn Schuller,et al.  The Computational Paralinguistics Challenge , 2012 .

[28]  Björn W. Schuller,et al.  The Voice of Leadership: Models and Performances of Automatic Analysis in Online Speeches , 2012, IEEE Transactions on Affective Computing.