Learning Deep Spatiotemporal Feature for Engagement Recognition of Online Courses

This paper focuses on the study of engagement recognition of online courses from students’ appearance and behavioral information using deep learning methods. Automatic engagement recognition can be applied to developing effective online instructional and assessment strategies for promoting learning. In this paper, we make two contributions. First, we propose a Convolutional 3D (C3D) neural networks-based approach to automatic engagement recognition, which models both the appearance and motion information in videos and recognize student engagement automatically. Second, we introduce the Focal Loss to address the class-imbalanced data distribution problem in engagement recognition by adaptively decreasing the weight of high engagement samples while increasing the weight of low engagement samples in deep spatiotemporal feature learning. Experiments on the DAiSEE dataset show the effectiveness of our method in comparison with the state-of-the-art automatic engagement recognition methods.

[1]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2]  Abhay Gupta,et al.  DAiSEE: Towards User Engagement Recognition in the Wild. , 2016, 1609.01885.

[3]  Allison Woodruff,et al.  Detecting user engagement in everyday conversations , 2004, INTERSPEECH.

[4]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[6]  Gwen Littlewort,et al.  Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[7]  Jennifer A. Fredricks,et al.  School Engagement: Potential of the Concept, State of the Evidence , 2004 .

[8]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9]  Minsu Jang,et al.  Identifying principal social signals in private student-teacher interactions for robot-enhanced education , 2013, 2013 IEEE RO-MAN.

[10]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[11]  Qianli Xu,et al.  Designing engagement-aware agents for multiparty conversations , 2013, CHI.

[12]  Michael A. Goodrich,et al.  Human-Robot Interaction: A Survey , 2008, Found. Trends Hum. Comput. Interact..

[13]  Ryan Shaun Joazeiro de Baker,et al.  Using Video to Automatically Detect Learner Affect in Computer-Enabled Classrooms , 2016, TIIS.

[14]  Christopher E. Peters Direction of Attention Perception for Conversation Initiation in Virtual Environments , 2005, IVA.

[15]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ana Paiva,et al.  Automatic analysis of affective postures and body motion to detect engagement with a game companion , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  I. Poggi Mind, hands, face and body. A goal and belief view of multimodal communication , 2007 .

[18]  Brian Scassellati,et al.  Comparing Models of Disengagement in Individual and Group Interactions , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[19]  Brandon G. King,et al.  Facial Features for Affective State Detection in Learning Environments , 2007 .

[20]  Jean-Marc Odobez,et al.  Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions , 2015, ICMI.

[21]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Daniel McDuff,et al.  AffectAura: an intelligent system for emotional memory , 2012, CHI.

[23]  Jaehong Kim,et al.  Automatic Recognition of Children Engagement from Facial Video Using Convolutional Neural Networks , 2020, IEEE Transactions on Affective Computing.

[24]  Gwen Littlewort,et al.  Dynamics of Facial Expression Extracted Automatically from Video , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[25]  Alexander I. Rudnicky,et al.  A Wizard-of-Oz Study on A Non-Task-Oriented Dialog Systems That Reacts to User Engagement , 2016, SIGDIAL Conference.

[26]  Kristy Elizabeth Boyer,et al.  Multimodal analysis of the implicit affective channel in computer-mediated textual communication , 2012, ICMI '12.

[27]  Jean-Marc Odobez,et al.  Engagement-based Multi-party Dialog with a Humanoid Robot , 2011, SIGDIAL Conference.

[28]  Ashish Kapoor,et al.  Multimodal affect recognition in learning environments , 2005, ACM Multimedia.

[29]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[30]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yukiko I. Nakano,et al.  Estimating user's engagement from eye-gaze behaviors in human-agent conversations , 2010, IUI '10.

[32]  Beverly Park Woolf,et al.  Affect-aware tutors: recognising and responding to student affect , 2009, Int. J. Learn. Technol..

[33]  Arthur C. Graesser,et al.  Multimethod assessment of affective experience and expression during deep learning , 2009, Int. J. Learn. Technol..

[34]  Beverly Park Woolf,et al.  A Dynamic Mixture Model to Detect Student Motivation and Proficiency , 2006, AAAI.

[35]  Javier R. Movellan,et al.  The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions , 2014, IEEE Transactions on Affective Computing.

[36]  Ana Paiva,et al.  Detecting user engagement with a robot companion using task and social interaction-based features , 2009, ICMI-MLMI '09.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Diane J. Litman,et al.  Adapting to Multiple Affective States in Spoken Dialogue , 2012, SIGDIAL Conference.

[39]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Emer Gilmartin,et al.  Conversational Engagement Recognition Using Auditory and Visual Cues , 2016, INTERSPEECH.