Automatic Detection and Classification of Head Movements in Face-to-Face Conversations

This paper presents an approach to automatic head movement detection and classification in data from a corpus of video-recorded face-to-face conversations in Danish involving 12 different speakers. A number of classifiers were trained with different combinations of visual, acoustic and word features and tested in a leave-one-out cross validation scenario. The visual movement features were extracted from the raw video data using OpenPose, and the acoustic ones using Praat. The best results were obtained by a Multilayer Perceptron classifier, which reached an average 0.68 F1 score across the 12 speakers for head movement detection, and 0.40 for head movement classification given four different classes. In both cases, the classifier outperformed a simple most frequent class baseline as well as a more advanced baseline only relying on velocity features.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Catherine Pelachaud,et al.  From brows to trust: evaluating embodied conversational agents , 2004 .

[3]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  P. Boersma Praat : doing phonetics by computer (version 5.1.05) , 2009 .

[5]  Gang Rong,et al.  A real-time head nod and shake detector using HMMs , 2003, Expert Syst. Appl..

[6]  Ashish Kapoor,et al.  A real-time head nod and shake detector , 2001, PUI '01.

[7]  David House,et al.  Acoustic features of multimodal prominences: Do visual beat gestures affect verbal pitch accent realization? , 2017, AVSP.

[8]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[9]  David S. Monaghan,et al.  Real-time head nod and shake detection for continuous human affect recognition , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[10]  Costanza Navarretta,et al.  Detecting head movements in video-recorded dyadic conversations , 2018, ICMI '18.

[11]  Evelyn Z. McClave Linguistic functions of head movements in the context of speech , 2000 .

[12]  Ying Wu,et al.  Vision-Based Gesture Recognition: A Review , 1999, Gesture Workshop.

[13]  Costanza Navarretta,et al.  Creating Comparable Multimodal Corpora for Nordic Languages , 2011, NODALIDA.

[14]  Costanza Navarretta,et al.  The Danish NOMCO corpus: multimodal interaction in first acquaintance conversations , 2017, Lang. Resour. Evaluation.

[15]  Jens Allwood,et al.  The structure of dialog , 1999 .

[16]  Costanza Navarretta,et al.  The NOMCO Multimodal Nordic Resource - Goals and Characteristics , 2010, LREC.

[17]  Louis-Philippe Morency,et al.  Co-occurrence graphs: contextual representation for head gesture recognition during multi-party interactions , 2009, UCVP '09.

[18]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[19]  Emiel Krahmer,et al.  Facial expression and prosodic prominence: Effects of modality and facial area , 2008, J. Phonetics.

[20]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Costanza Navarretta,et al.  Automatic identification of head movements in video-recorded conversations: can words help? , 2017, VL@EACL.

[22]  D. Loehr Aspects of rhythm in gesture and speech , 2007 .

[23]  Trevor Darrell,et al.  Contextual recognition of head gestures , 2005, ICMI '05.

[24]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[25]  Costanza Navarretta,et al.  The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena , 2007, Lang. Resour. Evaluation.

[26]  U. Hadar,et al.  Head Movement Correlates of Juncture and Stress at Sentence Level , 1983, Language and speech.

[27]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[28]  N. Thorsen Neutral Stress, Emphatic Stress, and Sentence Intonation in Advanced Standard Copenhagen Danish , 1980 .

[29]  Theresa Wilson,et al.  Agreement detection in multiparty conversation , 2009, ICMI-MLMI '09.

[30]  Bart Jongejan Automatic annotation of head velocity and acceleration in Anvil , 2012, LREC.

[31]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[32]  V. Yngve On getting a word in edgewise , 1970 .

[33]  Björn Granström,et al.  Audiovisual representation of prosody in expressive speech communication , 2004, Speech Commun..

[34]  Johan Frid,et al.  Towards classification of head movements in audiovisual recordings of read news , 2017 .