The quantification of gesture–speech synchrony: A tutorial and validation of multimodal data acquisition using device-based and video-based motion tracking

There is increasing evidence that hand gestures and speech synchronize their activity on multiple dimensions and timescales. For example, gesture’s kinematic peaks (e.g., maximum speed) are coupled with prosodic markers in speech. Such coupling operates on very short timescales at the level of syllables (200 ms), and therefore requires high-resolution measurement of gesture kinematics and speech acoustics. High-resolution speech analysis is common for gesture studies, given that field’s classic ties with (psycho)linguistics. However, the field has lagged behind in the objective study of gesture kinematics (e.g., as compared to research on instrumental action). Often kinematic peaks in gesture are measured by eye, where a “moment of maximum effort” is determined by several raters. In the present article, we provide a tutorial on more efficient methods to quantify the temporal properties of gesture kinematics, in which we focus on common challenges and possible solutions that come with the complexities of studying multimodal language. We further introduce and compare, using an actual gesture dataset (392 gesture events), the performance of two video-based motion-tracking methods (deep learning vs. pixel change) against a high-performance wired motion-tracking system (Polhemus Liberty). We show that the videography methods perform well in the temporal estimation of kinematic peaks, and thus provide a cheap alternative to expensive motion-tracking systems. We hope that the present article incites gesture researchers to embark on the widespread objective study of gesture kinematics and their relation to speech.

[1]  I. Mittelberg Gestures as image schemas and force gestalts: A dynamic systems approach augmented with motion-capture data analyses , 2018 .

[2]  Rashid Ansari,et al.  Multimodal human discourse: gesture and speech , 2002, TCHI.

[3]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Marcelo M. Wanderley,et al.  A Quantitative Comparison of Position Trackers for the Development of a Touch-less Musical Interface , 2012, NIME.

[5]  Stijn De Beugher,et al.  Automatic analysis of in-the-wild mobile eye-tracking experiments using object, face and person detection , 2014, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  P. Hagoort,et al.  Synchronization of speech and gesture: evidence for interaction in action. , 2014, Journal of experimental psychology. General.

[8]  Maurizio Gentilucci,et al.  On gesture and speech , 2015 .

[9]  R A States,et al.  Precision and repeatability of the Optotrak 3020 motion measurement system , 2006, Journal of medical engineering & technology.

[10]  James A. Dixon,et al.  Gesture Networks: Introducing Dynamic Time Warping and Network Analysis for the Kinematic Study of Gesture Ensembles , 2020 .

[11]  Hiroshi Ishiguro,et al.  Analysis of relationship between head motion events and speech in dialogue conversations , 2014, Speech Communication.

[12]  Thomas Seidl,et al.  Automated Pattern Analysis in Gesture Research: Similarity Measuring in 3D Motion Capture Models of Communicative Action , 2017, Digit. Humanit. Q..

[13]  Kevin M. Cury,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[14]  Louis Goldstein,et al.  Quantitative analysis of multimodal speech data , 2018, J. Phonetics.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Paul Treffner,et al.  Intentional and attentional dynamics of speech-hand coordination. , 2002, Human movement science.

[17]  A. Zeileis,et al.  zoo: S3 Infrastructure for Regular and Irregular Time Series , 2005, math/0505527.

[18]  Mark K. Tiede,et al.  A Kinematic Study of Prosodic Structure in Articulatory and Manual Gestures: Results from a Novel Method of Data Collection , 2017, Laboratory phonology.

[19]  Jelena Krivokapic,et al.  Speech and manual gesture coordination in a pointing task , 2016 .

[20]  Inge-Marie Eigsti,et al.  Conversational gestures in autism spectrum disorders: Asynchrony but not decreased frequency , 2010, Autism research : official journal of the International Society for Autism Research.

[21]  Bahia Guellaï,et al.  Prosody in the Auditory and Visual Domains: A Developmental Perspective , 2018, Front. Psychol..

[22]  Michael J Richardson,et al.  Can low-cost motion-tracking systems substitute a Polhemus system when researching social motor coordination in children? , 2017, Behavior research methods.

[23]  Jeesun Kim,et al.  Articulatory constraints on spontaneous entrainment between speech and manual gesture. , 2015, Human movement science.

[24]  Eric Auer,et al.  Combining video and numeric data in the analysis of sign languages with the ELAN annotation software , 2006 .

[25]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[26]  D. Loehr,et al.  Temporal, structural, and pragmatic synchrony between intonation and gesture , 2012 .

[27]  Joze Guna,et al.  An Analysis of the Precision and Reliability of the Leap Motion Sensor and Its Suitability for Static and Dynamic Tracking , 2014, Sensors.

[28]  Julius Verrel,et al.  Accuracy and Reliability of the Kinect Version 2 for Clinical Measurement of Motor Function , 2016, PloS one.

[29]  Alexandra Paxton,et al.  Frame-differencing methods for measuring bodily synchrony in conversation , 2012, Behavior Research Methods.

[30]  Caitlin Hilliard,et al.  A technique for continuous measurement of body movement from video , 2017, Behavior research methods.

[31]  David House,et al.  Aspects of co-occurring syllables and head nods in spontaneous dialogue , 2013, AVSP.

[32]  Hedda Lausberg,et al.  Methods in Gesture Research: , 2009 .

[33]  Karoly Zalka Accuracy and reliability , 2020 .

[34]  James A. Dixon,et al.  Entrainment and Modulation of Gesture–Speech Synchrony Under Delayed Auditory Feedback , 2018, Cogn. Sci..

[35]  H. Nusbaum,et al.  Visual cortex entrains to sign language , 2017, Proceedings of the National Academy of Sciences.

[36]  Thomas Seidl,et al.  Spatiotemporal Similarity Search in 3D Motion Capture Gesture Streams , 2015, SSTD.

[37]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[38]  J. Richards,et al.  The measurement of human motion: A comparison of commercially available systems , 1999 .

[39]  Jean-Luc Schwartz,et al.  The speech focus position effect on jaw-finger coordination in a pointing task. , 2008, Journal of speech, language, and hearing research : JSLHR.

[40]  J. Delafield-Butt,et al.  Toward the Autism Motor Signature: Gesture patterns during smart tablet gameplay identify children with autism , 2018 .

[41]  Irina Simanova,et al.  Toward the markerless and automatic analysis of kinematic features: A toolkit for gesture and movement research , 2018, Behavior Research Methods.

[42]  Frank Weichert,et al.  Analysis of the Accuracy and Robustness of the Leap Motion Controller , 2013, Sensors.

[43]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[44]  Matthias Bethge,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[45]  Dani Byrd,et al.  Spatiotemporal coupling between speech and manual motor actions , 2014, J. Phonetics.

[46]  Stefan Kopp,et al.  Gesture and speech in interaction: An overview , 2014, Speech Commun..

[47]  Fred Cummins,et al.  The temporal relation between beat gestures and speech , 2011 .

[48]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[49]  Michael J Richardson,et al.  Evidence of embodied social competence during conversation in high functioning children with autism spectrum disorder , 2018, PloS one.

[50]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[51]  Rick Dale,et al.  Complex Communication Dynamics: Exploring the Structure of an Academic Talk , 2019, Cogn. Sci..

[52]  Patrycja Strycharczuk,et al.  Whence the fuzziness? Morphological effects in interacting sound changes in Southern British English , 2017 .

[53]  Julius Hassemer Towards a theory of gesture form analysis: imaginary forms as part of gesture conceptualisation, with empirical support from motion-capture data , 2015 .

[54]  Stefanie Shattuck-Hufnagel,et al.  The Prosodic Characteristics of Non-referential Co-speech Gestures in a Sample of Academic-Lecture-Style Speech , 2018, Front. Psychol..

[55]  Steven J. Harrison,et al.  Gesture-speech physics: The biomechanical basis for the emergence of gesture-speech synchrony. , 2020, Journal of experimental psychology. General.

[56]  Núria Esteve-Gibert,et al.  Prosodic structure shapes the temporal realization of intonation and manual gesture movements. , 2013, Journal of speech, language, and hearing research : JSLHR.

[57]  Susan Shaiman,et al.  Effects of perturbation and prosody on the coordination of speech and gesture , 2014, Speech Commun..

[58]  Susan Duncan,et al.  Growth points in thinking-for-speaking , 1998 .