Annotation and taxonomy of gestures in lecture videos

Human arm and body gestures have long been known to hold significance in communication, especially with respect to teaching. We gather ground truth annotations of gesture appearance using a 27-bit pose vector. We manually annotate and analyze the gestures of two instructors, each in a 75-minute computer science lecture recorded to digital video, finding 866 gestures and identifying 126 fine equivalence classes which could be further clustered into 9 semantic classes. We observe these classes encompassing “pedagogical” gestures of punctuation and encouragement, as well as traditional classes such as deictic and metaphoric. We note that gestures appear to be both highly idiosyncratic and highly repetitive. We introduce a tool to facilitate the manual annotation of gestures in video, and present initial results on their frequencies and co-occurrences; in particular, we find that pointing (deictic) and “spreading” (pedagogical) predominate, and that 5 poses represent 80% of the variation in the annotated ground truth.

[1]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Justine Cassell,et al.  Temporal classification of natural gesture and application to video coding , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Craig Martell FORM: An Extensible, Kinematically-based Gesture Annotation Scheme , 2002, LREC.

[4]  A. Kendon Gesticulation and Speech: Two Aspects of the Process of Utterance , 1981 .

[5]  Wolff‐Michael Roth Gestures: Their Role in Teaching and Learning , 2001 .

[6]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[7]  John R. Kender,et al.  Semantic keyword extraction via adaptive text binarization of unstructured unsourced video , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[8]  Wolff-Michael Roth,et al.  Decalages in Talk and Gesture: Visual and Verbal Semiotics of Ecology Lectures , 1998 .

[9]  John R. Kender,et al.  Sort-Merge feature selection and fusion methods for classification of unstructured video , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[10]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[11]  Michael Neff,et al.  An annotation scheme for conversational gestures: how to economically capture timing and form , 2007, Lang. Resour. Evaluation.

[12]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[13]  John R. Kender,et al.  VAST MM: multimedia browser for presentation video , 2007, CIVR '07.