Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

This work investigates the effect of children age on pragmatic skills, i.e. on the way children participate in conversations, in particular when it comes to turn-management (who talks when and how much) and use of silences and pauses. The proposed approach combines the extraction of “Steady Conversational Periods” time intervals during which the structure of a conversation is stable with Observed Influence Models, Generative Score Spaces and feature selection strategies. The experiments involve 76 children split into two age groups: “pre-School” (3-4 years) and “School” (6-8 years). The statistical approach proposed in this work predicts the group each child belongs to with precision up to 85%. Furthermore, it identifies the pragmatic skills that better account for the difference between the two groups.

[1]  Alexandros Iosifidis,et al.  Movement recognition exploiting multi-view information , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[2]  Ioannis Pitas,et al.  View indepedent human movement recognition from multi-view video exploiting a circular invariant posture representation , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[3]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[4]  Friedhelm Schwenker,et al.  Solving Multi-class Pattern Recognition Problems with Tree-Structured Support Vector Machines , 2001, DAGM-Symposium.

[5]  Gwen Littlewort,et al.  Machine Learning Systems for Detecting Driver Drowsiness , 2009 .

[6]  J. Laver The phonetic description of voice quality , 1980 .

[7]  Mubarak Shah,et al.  Learning 4D action feature models for arbitrary view action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  M. Lugger,et al.  Classification of different speaking groups ITG Fachtagung Sprachkommunikation 2006 CLASSIFICATION OF DIFFERENT SPEAKING GROUPS BY MEANS OF VOICE QUALITY PARAMETERS , 2011 .

[9]  Friedhelm Schwenker,et al.  Fuzzy-Input Fuzzy-Output One-Against-All Support Vector Machines , 2007, KES.

[10]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[11]  John Kane,et al.  A spectral LF model based approach to voice source parameterisation , 2010, INTERSPEECH.

[12]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[13]  C. Gobl The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and Synthesis , 2003 .

[14]  Ingo Siegert,et al.  Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[15]  Lou Boves,et al.  Fitting a LF-model to inverse filter signals , 1993, EUROSPEECH.

[16]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[17]  Thierry Pun,et al.  Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[18]  Thomas S. Huang,et al.  Audio-visual affective expression recognition , 2007, International Symposium on Multispectral Image Processing and Pattern Recognition.

[19]  Andreas Wendemuth,et al.  Companion-Technology for Cognitive Technical Systems , 2011, KI - Künstliche Intelligenz.

[20]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[21]  Mirjam Wester Automatic Classification of Voice Quality: Comparing Regression Models and Hidden Markov Models , 1998 .

[22]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Bin Yang,et al.  Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Frédo Durand,et al.  Eulerian video magnification for revealing subtle changes in the world , 2012, ACM Trans. Graph..

[25]  John Kane,et al.  Identifying Regions of Non-Modal Phonation Using Features of the Wavelet Transform , 2011, INTERSPEECH.

[26]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[27]  G. Pfurtscheller,et al.  Brain-Computer Interfaces for Communication and Control. , 2011, Communications of the ACM.

[28]  Quarterly Progress and Status Report A preliminary study of acoustic voice quality correlates , 2007 .

[29]  Johannes Wagner,et al.  Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data , 2011, IEEE Transactions on Affective Computing.

[30]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[31]  Paavo Alku,et al.  Comparison of multiple voice source parameters in different phonation types , 2007, INTERSPEECH.

[32]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..